Types

                              By
        
                      Jack W. Crenshaw, Ph.D.
        
                           26 May 1990                     
        
*****************************************************************
*                                                               *
*                        COPYRIGHT NOTICE                       *
*                                                               *
*   Copyright (C) 1990 Jack W. Crenshaw. All rights reserved.   *
*                                                               *
*****************************************************************

Introduction

In the last instalment (Part XIII: Procedures) I mentioned that in that part and this one, we would cover the two features that tend to separate the toy language from a real, usable one. We covered procedure calls in that instalment. Many of you have been waiting patiently, since August '89, for me to drop the other shoe. Well, here it is.

In this instalment, we'll talk about how to deal with different data types. As I did in the last segment, I will not incorporate these features directly into the TINY compiler at this time. Instead, I'll be using the same approach that has worked so well for us in the past: using only fragments of the parser and single-character tokens. As usual, this allows us to get directly to the heart of the matter without having to wade through a lot of unnecessary code. Since the major problems in dealing with multiple types occur in the arithmetic operations, that's where we'll concentrate our focus.

A few words of warning: First, there are some types that I will not be covering in this instalment. Here we will only be talking about the simple, predefined types. We won't even deal with arrays, pointers or strings in this instalment; I'll be covering them in the next few. PPS: These are not covered later.

Second, we also will not discuss user-defined types. That will not come until much later, for the simple reason that I still haven't convinced myself that user-defined types belong in a language named KISS. In later instalments, I do intend to cover at least the general concepts of user-defined types, records, etc., just so that the series will be complete. But whether or not they will be included as part of KISS is still an open issue. I am open to comments or suggestions on this question.

Finally, I should warn you: what we are about to do can add considerable extra complication to both the parser and the generated code. Handling variables of different types is straightforward enough. The complexity comes in when you add rules about conversion between types. In general, you can make the compiler as simple or as complex as you choose to make it, depending upon the way you define the type-conversion rules. Even if you decide not to allow any type conversions (as in Ada, for example) the problem is still there, and is built into the mathematics. When you multiply two short numbers, for example, you can get a long result.

I've approached this problem very carefully, in an attempt to Keep It Simple. But we can't avoid the complexity entirely. As has so often has happened, we end up having to trade code quality against complexity, and as usual I will tend to opt for the simplest approach.

What's coming next?

Before diving into the tutorial, I think you'd like to know where we are going from here ... especially since it's been so long since the last instalment.

I have not been idle in the meantime. What I've been doing is reorganizing the compiler itself into units. One of the problems I've encountered is that as we've covered new areas and thereby added features to the TINY compiler, it's been getting longer and longer. I realized a couple of instalments back that this was causing trouble, and that's why I've gone back to using only compiler fragments for the last instalment and this one. The problem is that it just seems dumb to have to reproduce the code for, say, processing boolean exclusive OR's, when the subject of the discussion is parameter passing.

The obvious way to have our cake and eat it, too, is to break up the compiler into separately compilable modules, and of course the unit is an ideal vehicle for doing this. This allows us to hide some fairly complex code (such as the full arithmetic and boolean expression parsing) into a single unit, and just pull it in whenever it's needed. In that way, the only code I'll have to reproduce in these instalments will be the code that actually relates to the issue under discussion.

I've also been toying with Turbo 5.5, which of course includes the Borland object-oriented extensions to Pascal. I haven't decided whether to make use of these features, for two reasons. First of all, many of you who have been following this series may still not have 5.5, and I certainly don't want to force anyone to have to go out and buy a new compiler just to complete the series. Secondly, I'm not convinced that the O-O extensions have all that much value for this application. We've been having some discussions about that in CompuServe's CLM forum, and so far we've not found any compelling reason to use O-O constructs. This is another of those areas where I could use some feedback from you readers. Anyone want to vote for Turbo 5.5 and O-O?

In any case, after the next few instalments in the series, the plan is to upload to you a complete set of units, and complete functioning compilers as well. The plan, in fact, is to have three compilers: One for a single-character version of TINY (to use for our experiments), one for TINY and one for KISS. I've pretty much isolated the differences between TINY and KISS, which are these:

  • TINY will support only two data types: The character and the 16-bit integer. I may also try to do something with strings, since without them a compiler would be pretty useless. KISS will support all the usual simple types, including arrays and even floating point.
  • TINY will only have two control constructs, the IF and the WHILE. KISS will support a very rich set of constructs, including one we haven't discussed here before ... the CASE.
  • KISS will support separately compilable modules.

For the programs I plan to upload, all the code generation has been carefully encapsulated into a single unit, so that any enterprising student should be able to easily re-target to any other processor. This task is "left as an exercise for the student".

But enough talk. Let's get on with the study of types. As I said earlier, we'll do this one as we did in the last instalment: by performing experiments using single-character tokens.

The Symbol Table

It should be apparent that, if we're going to deal with variables of different types, we're going to need some place to record what those types are. The obvious vehicle for that is the symbol table, and we've already used it that way to distinguish, for example, between local and global variables, and between variables and procedures.

The symbol table structure for single-character tokens is particularly simple, and we've used it several times before. To deal with it, we'll steal some procedures that we've used before.

First, we need to declare the symbol table itself:

{ Variable Declarations }
var 
  Look : char;              { Lookahead Character }
  ST : array['A' .. 'Z'] of char;   {  *** ADD THIS LINE ***}

Next, we need to make sure it's initialized as part of procedure Init:

{ Initialize }
procedure Init;
var
  i : char;
begin
  for i := 'A' to 'Z' do
    ST[i] := '?';
  GetChar;
end;

We don't really need the next procedure, but it will be helpful for debugging. All it does is to dump the contents of the symbol table:

{ Dump the symbol table }
procedure DumpTable;
var 
  i : char;
begin
  for i := 'A' to 'Z' do
    WriteLn(i, ' ', ST[i]);
end;  

It really doesn't matter much where you put this procedure ... I plan to cluster all the symbol table routines together, so I put mine just after the error reporting procedures.

If you're the cautious type (as I am), you might want to begin with a test program that does nothing but initializes, then dumps the table. Just to be sure that we're all on the same wavelength here, I'm reproducing the entire program below, complete with the new procedures. Note that this version includes support for white space:

program Types01;
{$Apptype Console}
uses
  SysUtils;

{ Constant declarations }
const
  TAB = ^I;
  CR  = ^M;
  LF  = ^J;

{ Variable declarations }
var
  Look : char;              { Lookahead Character }
  ST : array['A' .. 'Z'] of char;

{ Read new character from input stream }
procedure GetChar;
begin
  Read(Look);
end;

{ Report an error }
procedure Error(s : string);
begin
  WriteLn;
  WriteLn(^G, 'Error: ', s, '.');
  ReadLn;
  ReadLn;
end;

{ Report error and halt }
procedure Abort(s : string);
begin
  Error(s);
  Halt;
end;

{ Report what was expected }
procedure Expected(s : string);
begin
  Abort(s + ' Expected');
end;

{ Dump the symbol table }
procedure DumpTable;
var i : char;
begin
  for i := 'A' to 'Z' do
    WriteLn(i, ' ', ST[i]);
end;

{ Recognize an alpha character }
function IsAlpha(c : char): boolean;
begin
  IsAlpha := UpCase(c) in ['A' .. 'Z'];
end;

{ Recognize a decimal digit }
function IsDigit(c : char) : boolean;
begin
  IsDigit := c in ['0' .. '9'];
end;

{ Recognize an alphanumeric character }
function IsAlNum(c : char) : boolean;
begin
  IsAlNum := IsAlpha(c) or IsDigit(c);
end;

{ Recognize an addop }
function IsAddop(c : char) : boolean;
begin
  IsAddop := c in ['+', '-'];
end;

{ Recognize a mulop }
function IsMulop(c : char) : boolean;
begin
  IsMulop := c in ['*', '/'];
end;

{ Recognize a Boolean orop }
function IsOrop(c : char) : boolean;
begin
  IsOrop := c in ['|', '~'];
end;

{ Recognize a relop }
function IsRelop(c : char): boolean;
begin
  IsRelop := c in ['=', '#', '<', '>'];
end;

{ Recognize white space }
function IsWhite(c : char) : boolean;
begin
  IsWhite := c in [' ', TAB];
end;

{ Skip over leading white space }
procedure SkipWhite;
begin
  while IsWhite(Look) do
    GetChar;
end;

{ Skip over an end-of-line }
procedure Fin;
begin
  if Look = CR then
    begin
      GetChar;
      if Look = LF then
        GetChar;
   end;
end;

{ Match a specific input character }
procedure Match(x : char);
begin
  if Look = x then
    GetChar
  else Expected('''' + x + '''');
    SkipWhite;
end;

{ Get an identifier }
function GetName : char;
begin
  if not IsAlpha(Look) then
    Expected('Name');
  GetName := UpCase(Look);
  GetChar;
  SkipWhite;
end;

{ Get a number }
function GetNum : char;
begin
  if not IsDigit(Look) then
    Expected('Integer');
  GetNum := Look;
  GetChar;
  SkipWhite;
end;

{ Output a string with tab }
procedure Emit(s : string);
begin
  Write(TAB, s);
end;

{ Output a string with tab and CRLF }
procedure EmitLn(s : string);
begin
  Emit(s);
  WriteLn;
end;

{ Initialize }
procedure Init;
var
  i : char;
begin
  for i := 'A' to 'Z' do
    ST[i] := '?';
  GetChar;
  SkipWhite;
end;

{ Main program }
begin
  Init;
  DumpTable;
  ReadLn;
  ReadLn;
end.

OK, run this program and press Enter. You should get a (very fast) printout of all the letters of the alphabet (potential identifiers), each followed by a question mark. Not very exciting, but it's a start.

Of course, in general we only want to see the types of the variables that have been defined. We can eliminate the others by modifying DumpTable with an IF test. Change the loop to read:

 for i := 'A' to 'Z' do
   if ST[i] <> '?' 
     then
       WriteLn(i, ' ', ST[i]);   
 

Now, run the program again. What did you get?

Well, that's even more boring than before! There was no output at all, since at this point none of the names have been declared. We can spice things up a bit by inserting some statements declaring some entries in the main program. Try these:

ST['A'] := 'a';
ST['P'] := 'b';
ST['X'] := 'c';  

This time, when you run the program, you should get an output showing that the symbol table is working.

Adding Entries

Of course, writing to the table directly is pretty poor practice, and not one that will help us much later. What we need is a procedure to add entries to the table. At the same time, we know that we're going to need to test the table, to make sure that we aren't redeclaring a variable that's already in use (easy to do with only 26 choices!). To handle all this, enter the following new procedures:

{ Report type of a variable }
function TypeOf(N : char): char;
begin
  TypeOf := ST[N];
end;

{ Report if a variable is in the table }
function InTable(N : char): boolean;
begin
  InTable := TypeOf(N) <> '?';
end; 

{ Check for a duplicate variable name }
procedure CheckDup(N : char);
begin
  if InTable(N) then 
    Abort('Duplicate Name ' + N);
end;

{ Add entry to table }
procedure AddEntry(N, T : char);
begin
  CheckDup(N);
  ST[N] := T;
end;
Now change the three lines in the main program to read:
AddEntry('A', 'a');
AddEntry('P', 'b');
AddEntry('X', 'c');  
and run the program again. Did it work? Then we have the symbol table routines needed to support our work on types. In the next section, we'll actually begin to use them.

Allocating Storage

In other programs like this one, including the TINY compiler itself, we have already addressed the issue of declaring global variables, and the code generated for them. Let's build a vestigial version of a "compiler" here, whose only function is to allow us declare variables. Remember, the syntax for a declaration is:

        <data decl> ::= VAR <identifier>  

Again, we can lift a lot of the code from previous programs. The following are stripped-down versions of those procedures. They are greatly simplified since I have eliminated niceties like variable lists and initializers. In procedure Alloc, note that the new call to AddEntry will also take care of checking for duplicate declarations:

{ Allocate storage for a variable }
procedure Alloc(N : char);
begin
  AddEntry(N, 'v');
  WriteLn('//Allocate storage for ', N);
end;

{ Parse and translate a data declalration }
procedure Decl;
begin
  Match('v');
  Alloc(GetName);
end;

{ Parse and translate global declarations }
procedure TopDecls;
begin
  while Look <> '.' do 
    begin
      case Look of
        'v' : Decl;
      else 
        Abort('Unrecognized Keyword ' + Look);
      end;
      Fin;
    end;
end;

Now, in the main program, add a call to TopDecls and run the program. Try allocating a few variables, and note the resulting code generated. This is old stuff for you, so the results should look familiar. Note from the code for TopDecls that the program is ended by a terminating period.

While you're at it, try declaring two variables with the same name, and verify that the parser catches the error.

Declaring Types

Allocating storage of different sizes is as easy as modifying procedure TopDecls to recognize more than one keyword. There are a number of decisions to be made here, in terms of what the syntax should be, etc., but for now I'm going to duck all the issues and simply declare by executive fiat that our syntax will be:
      <data decl> ::= <typename>  <identifier>  
where:
        <typename> ::= BYTE | WORD | LONG  

We can create the code to take care of these declarations with only slight modifications. In the routines below, note that I've separated the code generation parts of Alloc from the logic parts. This is in keeping with our desire to encapsulate the machine-dependent part of the compiler.

{ Generate code for allocation of a variable }
procedure AllocVar(N, T : char);
var
  VarType : String;
begin
  case T of
    'B' : VarType := 'shortint';
    'W' : VarType := 'smallint';
    'L' : VarType := 'integer';
  end;
  WriteLn('var ', N, ' :', TAB, VarType, ';');
end;

{ Allocate storage for a variable }
procedure Alloc(N, T : char);
begin
  AddEntry(N, T);
  AllocVar(N, T);
end;

{ Parse and translate a data declaration }
procedure Decl;
var 
  Typ : char;
begin
  Typ := GetName;
  Alloc(GetName, Typ);
end;

{ Parse and translate global declarations }
procedure TopDecls;
begin
  while Look <> '.' do 
    begin
      case Look of
        'b', 'w', 'l' : Decl;
      else 
        Abort('Unrecognized Keyword ' + Look);
      end;
      Fin;
    end;
end;  

Make the changes shown to these procedures, and give the thing a try. Use the single characters 'b', 'w', and 'l' for the keywords (they must be lower case, for now). You will see that in each case, we are allocating the proper storage size. Note from the dumped symbol table that the sizes are also recorded for later use. What later use? Well, that's the subject of the rest of this instalment.

Assignments

Now that we can declare variables of different sizes, it stands to reason that we ought to be able to do something with them. For our first trick, let's just try loading them into our working register, EAX. We also want to continue to encapsulate the machine-dependent stuff. The MOVSX instruction moves with sign extension, so that a negative number in a shortint or smallint variable will be stored correctly in the 32-bit register. The load procedure looks like this:

{ Load a variable to primary register }      
procedure LoadVar(Name, Typ : char);
begin
  if Typ = 'L' then
    EmitLn('MOV EAX, ' + Name)
  else
    EmitLn('MOVSX EAX, ' + Name);
end;

Note that the routine is strictly a code generators; it has no error-checking or other logic. To complete the picture, we need one more layer of software that provides these functions.

First of all, we need to make sure that the type we are dealing with is a loadable type. This sounds like a job for another recognizer:

{ Recognize a legal variable type }
function IsVarType(c : char) : boolean;
begin
  IsVarType := c in ['B', 'W', 'L'];
end;  

Next, it would be nice to have a routine that will fetch the type of a variable from the symbol table, while checking it to make sure it's valid:

{ Get a variable type from the symbol talble }
function VarType(Name : char) : char;
var 
  Typ : char;
begin
  Typ := TypeOf(Name);
  if not IsVarType(Typ) then 
    Abort('Identifier ' + Name + ' is not a variable');
  VarType := Typ;
end;  

Armed with these tools, a procedure to cause a variable to be loaded becomes trivial:

 { Load a variable to the primary register }
procedure Load(Name : char);
begin  
  LoadVar(Name, VarType(Name));
end;  
 

(Note to the concerned: I know, I know, all this is all very inefficient. In a production program, we probably would take steps to avoid such deep nesting of procedure calls. Don't worry about it. This is an exercise, remember? It's more important to get it right and understand it, than it is to make it get the wrong answer, quickly. If you get your compiler completed and find that you're unhappy with the speed, feel free to come back and hack the code to speed it up!)

It would be a good idea to test the program at this point. Since we don't have a procedure for dealing with assignments yet, we used the main program:
{ Main program }
begin
  Init;
  TopDecls;
  Load('A');
  Load('B');
  Load('C');
  ReadLn;
  ReadLn;
end.
with the input  wa lb bc.

You can play around with this, and try different combinations of declarations to see how the errors are handled.

I'm sure you won't be surprised to learn that storing variables is a lot like loading them. The necessary procedures are shown next:

{ Store primary register in variable }      
procedure StoreVar(Name, Typ : char);
begin
  case Typ of  
    'L' : EmitLn('MOV ' + Name + ', EAX');
    'W' : EmitLn('MOV ' + Name + ', AX');
    'B' : EmitLn('MOV ' + Name + ', AL');
  end;
end;

{ Store a variable from the primary register }
procedure Store(Name : char);
begin  
  StoreVar(Name, VarType(Name));
end;

You can test this one the same way as the loads.

Now, of course, it's a rather small step to use these to handle assignment statements. What we'll do is to create a special version of procedure Block that supports only assignment statements, and also a special version of Expression that only supports single variables as legal expressions. Here they are:
{ Parse and translate an expression }
procedure Expression;
var 
  Name : char;
begin
  Load(GetName);
end;

{ Parse and translate an assignment statement }
procedure Assignment;
var 
  Name : char;
begin
  Name := GetName;
  Match('=');
  Expression;
  Store(Name);
end;

{ Parse and translate a block of statements }
procedure Block;
begin
  while Look <> '.' do 
    begin
      Assignment;
      Fin;
    end;
end;  

(It's worth noting that, if anything, the new procedures that permit us to manipulate types are, if anything, even simpler and cleaner than what we've seen before. This is mostly thanks to our efforts to encapsulate the code generator procedures.)

There is one small, nagging problem. Before, we used the Pascal terminating period to get us out of procedure TopDecls. This is now the wrong character ... it's used to terminate Block. In previous programs, we've used the BEGIN symbol (abbreviated 'b') to get us out. But that is now used as a type symbol.

The solution, while somewhat of a kludge, is easy enough. We'll use an upper case 'B' to stand for the BEGIN. So change the character in the WHILE loop within TopDecls, from '.' to 'B', and everything will be fine.

Now, we can complete the task by changing the main program to read:

{ Main program }
begin
  Init;
  TopDecls;
  Match('B');
  Fin;
  Block;
  DumpTable;
  ReadLn;
  ReadLn;
end.  

(Note that I've had to sprinkle a few calls to Fin around to get us out of Newline troubles.)

OK, run this program. Try the input:

     ba        { byte a }   *** DON'T TYPE THE COMMENTS!!! ***
     wb        { word b }
     lc        { long c }
     B         { begin  }
     a=a
     a=b
     a=c
     b=a
     b=b
     b=c
     c=a
     c=b
     c=c
     .

You should get code generated that declares each variable. For each assignment, you should get code that loads and stores a variable.

Save the program as Types02 in preparation for assigning literal values to variables.

Before we do anything with variables of different types, even if it's just to copy them, we have to face up to the issue of type conversion or coercion. It is not the most easy part of a compiler. Most of the bugs I have seen in production compilers have had to do with errors in type conversion for some obscure combination of arguments. As usual, there is a trade-off between compiler complexity and the potential quality of the generated code, and as usual, we will take the path that keeps the compiler simple. I think you'll find that, with this approach, we can keep the potential complexity in check rather nicely.

The Coward's Way Out

Before we get into the details (and potential complexity) of type conversion, I'd like you to see that there is one super-simple way to solve the problem: simply promote every variable to a long integer when we load it!

PPS: We will not take this further because we could instead simply revert to the older versions with no choice of variable type.

A More Reasonable Solution

Is there still a relatively easy way to get data conversion? Can we still Keep It Simple?

Yes, indeed. All we have to do is to make the conversion at the other end ... that is, we convert on the way out, when the data is stored, rather than on the way in.

PPS: We are omitting this tricky section, at least for the time being, because the remaining instalments revert to a single integer type. Consult the original tutorial if you are planning to use more than one type of number in your translator.

Literal Arguments

Sharp-eyed readers might have noticed, though, that we don't even have a proper form of a simple factor yet, because we don't allow for loading literal constants, only variables. Let's fix that now.

To begin with, we'll need another GetNum function. We've seen several versions of this, some returning only a single character, some a string, and some an integer. The one needed here will return an integer, so that it can handle anything we throw at it. Note that no type information is returned here: GetNum doesn't concern itself with how the number will be used:

{ Get a number }
function GetNum : LongInt;
var
  Val : LongInt;
begin
  if not IsDigit(Look) then 
    Expected('Integer');
  Val := 0;
  while IsDigit(Look) do 
    begin
      Val := 10 * Val + Ord(Look) - Ord('0');
      GetChar;
    end;
  GetNum := Val;
  SkipWhite;
end;        
      

Now, when dealing with literal data, we have one small problem. With variables, we know what type things should be because they've been declared to be that type. We have no such type information for literals. When the programmer says, "-1", does that mean a byte, word, or integer version? We have no clue. The obvious thing to do would be to use the largest type possible, i.e. an integer. But that's a bad idea, because when we get to more complex expressions, we'll find that it will cause every expression involving literals to be promoted to long (integer), as well.

PPS: We understand that promotion is not a bad thing nowadays, because the registers are designed to work with integers. We will stick to the use of the full EAX register where possible (rather than the AX and AL portions depending on the type). Loading a constant becomes trivial:

{ Load a Constant to the Primary Register }
procedure LoadConst(N : LongInt);
var
  temp : string;
begin
  Str(N, temp);
  EmitLn('MOV EAX, ' + temp);
end;

Now we can modify procedure Expression to accommodate the two possible kinds of factor:

{ Parse and translate an expression }
procedure Expression;
begin
  if IsAlpha(Look) then
    Load(GetName)
  else
    LoadConst(GetNum);
end;    
   

(Wow, that sure didn't hurt too bad! Just a few extra lines do the job.)

OK, compile this code into your program and give it a try. You'll see that it now works for either variables or constants as valid expressions. Save the program as Types03 in preparation for more expressions.

Additive Expressions

If you've been following this series from the beginning, I'm sure you know what's coming next. We'll expand the form for an expression to handle first additive expressions, then multiplicative, then general expressions with parentheses.

The nice part is that we already have a pattern for dealing with these more complex expressions.

PPS: You can rename our existing Expression to Term, as before, and create the changed version of Expression. You can go back to previous programs such as Parse03 and use existing procedures for Expression, Add and Subtract, then proceed as before to introduce multiplication and division etc. Consult the original Part XIV to see the corresponding functions to return the type of integer and ways of inter-converting integer types on the 68k. We will skip ahead to the concluding discussion.

Beginning to Wind Down

At last, in this instalment, we've learned how to deal with variables (and literals) of different types. As you can see, it hasn't been too tough. In fact, in some ways most of the code looks even more simple than it does in earlier programs. Only the multiplication and division operators require a little thinking and planning.

The main concept that made things easy was that of converting procedures such as Expression into functions that return the type of the result. Once this was done, we were able to retain the same general structure of the compiler.

I won't pretend that we've covered every single aspect of the issue. I conveniently ignored unsigned arithmetic. From what we've done, I think you can see that to include them adds no new challenges, just extra possibilities to test for.

I've also ignored the logical operators And, Or, etc. It turns out that these are pretty easy to handle. All the logical operators are bitwise operations, so they are symmetric and therefore work in the same fashion as PopAdd. There is one difference, however: if it is necessary to extend the word length for a logical variable, the extension should be done as an unsigned number. Floating point numbers, again, are straightforward to handle ... just a few more procedures to be added to the run-time library, or perhaps instructions for a maths chip.

Perhaps more importantly, I have also skirted the issue of type checking, as opposed to conversion. In other words, we've allowed for operations between variables of all combinations of types. In general this will not be true ... certainly you don't want to add an integer, for example, to a string. Most languages also don't allow you to mix up character and integer variables.

Again, there are really no new issues to be addressed in this case. We are already checking the types of the two operands ... much of this checking gets done in procedures like SameType. It's pretty straightforward to include a call to an error handler, if the types of the two operands are incompatible.

In the general case, we can think of every single operator as being handled by a different procedure, depending upon the type of the two operands. This is straightforward, though tedious, to implement simply by implementing a jump table with the operand types as indices. In Pascal, the equivalent operation would involve nested Case statements. Some of the called procedures could then be simple error routines, while others could effect whatever kind of conversion we need. As more types are added, the number of procedures goes up by a square-law rule, but that's still not an unreasonably large number of procedures.

What we've done here is to collapse such a jump table into far fewer procedures, simply by making use of symmetry and other simplifying rules.

To Coerce Or Not To Coerce

In case you haven't gotten this message yet, it sure appears that TINY and KISS will probably not be strongly typed languages, since I've allowed for automatic mixing and conversion of just about any type. Which brings up the next issue:

Is this really what we want to do?

The answer depends on what kind of language you want, and the way you'd like it to behave. What we have not addressed is the issue of when to allow and when to deny the use of operations involving different data types. In other words, what should be the semantics of our compiler? Do we want automatic type conversion for all cases, for some cases, or not at all?

Let's pause here to think about this a bit more. To do so, it will help to look at a bit of history.

FORTRAN II supported only two simple data types: Integer and Real. It allowed implicit type conversion between real and integer types during assignment, but not within expressions. All data items (including literal constants) on the right-hand side of an assignment statement had to be of the same type. That made things pretty easy ... much simpler than what we've had to do here.

This was changed in FORTRAN IV to support "mixed-mode" arithmetic. If an expression had any real data items in it, they were all converted to reals and the expression itself was real. To round out the picture, functions were provided to explicitly convert from one type to the other, so that you could force an expression to end up as either type.

This led to two things: code that was easier to write, and code that was less efficient. That's because sloppy programmers would write expressions with simple constants like 0 and 1 in them, which the compiler would dutifully compile to convert at execution time. Still, the system worked pretty well, which would tend to indicate that implicit type conversion is a Good Thing.

C is also a weakly typed language, though it supports a larger number of types. C won't complain if you try to add a character to an integer, for example. Partly, this is helped by the C convention of promoting every char to integer when it is loaded, or passed through a parameter list. This simplifies the conversions quite a bit. In fact, in subset C compilers that don't support long or float types, we end up back where we were in our earlier, simple-minded first try: every variable has the same representation, once loaded into a register. Makes life pretty easy!

The ultimate language in the direction of automatic type conversion is PL/I. This language supports a large number of data types, and you can mix them all freely. If the implicit conversions of FORTRAN seemed good, then those of PL/I should have been Heaven, but it turned out to be more like Hell! The problem was that with so many data types, there had to be a large number of different conversions, and a correspondingly large number of rules about how mixed operands should be converted. These rules became so complex that no one could remember what they were! A lot of the errors in PL/I programs had to do with unexpected and unwanted type conversions. Too much of a Good Thing can be bad for you!

Pascal, on the other hand, is a language which is "strongly typed", which means that in general you can't mix types, even if they differ only in name, and yet have the same base type! Niklaus Wirth made Pascal strongly typed to help keep programmers out of trouble, and the restrictions have indeed saved many a programmer from himself, because the compiler kept him from doing something dumb. Better to find the bug in compilation rather than the debug phase. The same restrictions can also cause frustration when you really want to mix types, and they tend to drive an ex-C-programmer up the wall.

Even so, Pascal does permit some implicit conversions. You can assign an integer to a real value. You can also mix integer and real types in expressions of type Real. The integers will be automatically coerced to real, just as in FORTRAN (and with the same hidden cost in run-time overhead).

You can't, however, convert the other way, from real to integer, without applying an explicit conversion function, Trunc. The theory here is that, since the numerical value of a real number is necessarily going to be changed by the conversion (the fractional part will be lost), you really shouldn't do it in "secret".

In the spirit of strong typing, Pascal will not allow you to mix Char and Integer variables, without applying the explicit coercion functions Chr and Ord.

Turbo Pascal also includes the types Byte, Word, and LongInt. The first two are basically the same as unsigned integers. In Turbo, these can be freely intermixed with variables of type Integer, and Turbo will automatically handle the conversion. There are run-time checks, though, to keep you from overflowing or otherwise getting the wrong answer. Note that you still can't mix Byte and Char types, even though they are stored internally in the same representation.

The ultimate in a strongly-typed language is Ada, which allows no implicit type conversions at all, and also will not allow mixed-mode arithmetic. Jean Ichbiah's position is that conversions cost execution time, and you shouldn't be allowed to build in such cost in a hidden manner. By forcing the programmer to explicitly request a type conversion, you make it more apparent that there could be a cost involved.

I have been using another strongly-typed language, a delightful little language called Whimsical, by John Spray. Although Whimsical is intended as a systems programming language, it also requires explicit conversion every time. There are never any automatic conversions, even the ones supported by Pascal.

This approach does have certain advantages: The compiler never has to guess what to do: the programmer always tells it precisely what he wants. As a result, there tends to be a more nearly one-to-one correspondence between source code and compiled code, and John's compiler produces very tight code.

On the other hand, I sometimes find the explicit conversions to be a pain. If I want, for example, to add one to a character, or AND it with a mask, there are a lot of conversions to make. If I get it wrong, the only error message is "Types are not compatible". As it happens, John's particular implementation of the language in his compiler doesn't tell you exactly which types are not compatible ... it only tells you which line the error is in.

I must admit that most of my errors with this compiler tend to be errors of this type, and I've spent a lot of time with the Whimsical compiler, trying to figure out just where in the line I've offended it. The only real way to fix the error is to keep trying things until something works.

So what should we do in TINY and KISS? For the first one, I have the answer: TINY will support only the types Char and Integer, and we'll use the C trick of promoting Chars to Integers internally. That means that the TINY compiler will be much simpler than what we've already done. Type conversion in expressions is sort of moot, since none will be required! Since longwords will not be supported, we also won't need the MUL32 and DIV32 run-time routines, nor the logic to figure out when to call them. I like it!

KISS, on the other hand, will support the type Long.

Should it support both signed and unsigned arithmetic? For the sake of simplicity I'd rather not. It does add quite a bit to the complexity of type conversions. Even Niklaus Wirth has eliminated unsigned (Cardinal) numbers from his new language Oberon, with the argument that 32-bit integers should be long enough for anybody, in either case.

But KISS is supposed to be a systems programming language, which means that we should be able to do whatever operations that can be done in assembler. Since the Intel supports both flavours of integers, I guess KISS should, also. We've seen that logical operations need to be able to extend integers in an unsigned fashion, so the unsigned conversion procedures are required in any case.

Conclusion

That wraps up our session on types. Sorry you had to wait so long for it, but hope you feel that it was worth the wait.

In the next few instalments, we'll extend the simple types to include arrays and pointers, and we'll have a look at what to do about strings. (PPS: Arrays, pointers and strings are not covered.) That should pretty well wrap up the mainstream part of the series. After that, I'll give you the new versions of the TINY and KISS compilers, and then we'll start to look at optimization issues.

See you then.

*****************************************************************
*                                                               *
*                        COPYRIGHT NOTICE                       *
*                                                               *
*   Copyright (C) 1990 Jack W. Crenshaw. All rights reserved.   *
*                                                               *
*****************************************************************
Programming - a skill for life!

PPS introduction to Let's build a compiler! by Jack Crenshaw