Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life



[prev]
[next]

2021-09-06
Let's write a self-hosting compiler, part 1: Reading in source code

All source code for this blog post can be found here.

We begin a new journey today. It is time to use our compiler, written in C accepting PL/0. What we want to do now is write a new compiler, written in PL/0 accepting PL/0. Because the compiler we have understands PL/0, we can use it until we are at the point where our new compiler is good enough to fully compile itself. Once that happens, we can discard our current compiler and from then on use the new compiler to compile itself.

But we're not going to get there in a day. We will take it, as always, one step at a time. For today, we'll talk out the plan for our self-hosting compiler and code up enough to where we have a program written in PL/0 that can read in source code and print out what it read in.

A plan of action

The quickest way to get to the interesting parts of a self-hosting compiler will be to directly translate (as best we can) the C implementation to PL/0. Because let's be honest: now that we've written a lexer and a parser, those are not particularly interesting problems for us anymore. We still have to write the PL/0 versions, but doing so won't much teach us anything new. We will also translate the C code generator to start out so we can most quickly see our self-hosting compiler actually self-hosting. But after that the real fun begins. We will add additional syntax and remove the C code generator in favor of QBE, which we reviewed in our previous blog post.

PL/0 grammar

As a reminder for ourselves, here is the PL/0 grammar as found in a comment in our first compiler. We'll need this soon for our front end:

/*
 * pl0c -- PL/0 compiler.
 *
 * program	= block "." .
 * block	= [ "const" ident "=" number { "," ident "=" number } ";" ]
 *		  [ "var" ident [ array ] { "," ident [ array ] } ";" ]
 *		  { "forward" ident ";" }
 *		  { "procedure" ident ";" block ";" } statement .
 * statement	= [ ident ":=" expression
 *		  | "call" ident
 *		  | "begin" statement { ";" statement } "end"
 *		  | "if" condition "then" statement [ "else" statement ]
 *		  | "while" condition "do" statement
 *		  | "readInt" [ "into" ] ident
 *		  | "readChar" [ "into" ] ident
 *		  | "writeInt" expression
 *		  | "writeChar" expression
 *		  | "writeStr" ( ident | string )
 *		  | "exit" expression ] .
 * condition	= "odd" expression
 *		| expression ( comparator ) expression .
 * expression	= [ "+" | "-" | "not" ] term { ( "+" | "-" | "or" ) term } .
 * term		= factor { ( "*" | "/" | "mod" | "and" ) factor } .
 * factor	= ident
 *		| number
 *		| "(" expression ")" .
 * comparator	= "=" | "#" | "<" | ">" | "<=" | ">=" | "<>" .
 * array	= "size" number .
 */

Reading in input

For today, let's write a program that reads in source code and writes it out, so that we know things work.

Additionally, to prepare ourselves for writing our front end, I am also going to translate all our #defines from C to PL/0. That would be to keep the same names but mark them as const in PL/0. I am also going to add some data structures that we know we'll need, like a symbol table, buffers for token strings, and a line number. Let's write it out and we can discuss:

{
  pl0c -- PL/0 compiler written in PL/0.
}

const
  CHECK_LHS	= 0,
  CHECK_RHS	= 1,
  CHECK_CALL	= 2,

  TOK_IDENT	= 'I',
  TOK_NUMBER	= 'N',
  TOK_CONST	= 'C',
  TOK_VAR	= 'V',
  TOK_PROCEDURE	= 'P',
  TOK_CALL	= 'c',
  TOK_BEGIN	= 'B',
  TOK_END	= 'E',
  TOK_IF	= 'i',
  TOK_THEN	= 'T',
  TOK_ELSE	= 'e',
  TOK_WHILE	= 'W',
  TOK_DO	= 'D',
  TOK_ODD	= 'O',
  TOK_WRITEINT	= 'w',
  TOK_WRITECHAR	= 'H',
  TOK_WRITESTR	= 'S',
  TOK_READINT	= 'R',
  TOK_READCHAR	= 'h',
  TOK_INTO	= 'n',
  TOK_SIZE	= 's',
  TOK_EXIT	= 'X',
  TOK_AND	= '&',
  TOK_OR	= '|',
  TOK_NOT	= '~',
  TOK_DOT	= '.',
  TOK_EQUAL	= '=',
  TOK_COMMA	= ',',
  TOK_SEMICOLON	= ';',
  TOK_ASSIGN	= ':',
  TOK_HASH	= '#',
  TOK_LTHAN	= '<',
  TOK_GTHAN	= '>',
  TOK_LTHANE	= '{',
  TOK_GTHANE	= '}',
  TOK_PLUS	= '+',
  TOK_MINUS	= '-',
  TOK_MULTIPLY	= '*',
  TOK_DIVIDE	= '/',
  TOK_MODULO	= '%',
  TOK_LPAREN	= '(',
  TOK_RPAREN	= ')',
  TOK_LBRACK	= '[',
  TOK_RBRACK	= ']',
  TOK_STRING	= '"'
;

var
  raw size 1048576, { 8 MB, can hold files up to 1 MB in size }
  symtab size 1048576, { 8 MB, can hold up to 30840 symbols }
  token size 32, { 31 characters + NUL }
  str size 32, { For cmpstr }
  symtabcnt, { To keep track of number of symtab entries }
  ret, { Return code for procedures that need one }
  line { line number }
;

{ Misc. functions }

procedure error;
begin
  writeStr 'pl0c: error: ';
  writeInt line;
  writeStr ': '
end;

procedure readin;
  var ch, i;
begin
  i := 0;
  readChar into ch;
  while ch # -1 do
  begin
    raw[i] := ch;
    i := i + 1;
    if i = 1048576 then { File too big! }
    begin
      call error;
      writeStr 'file too big\n';
      exit 1
    end;
    readChar into ch
  end;
end;

{ Lexer }

{ Code generator }

{ Parser }

procedure parse;
begin
  writeStr raw
end;

{ Main }

begin
  call readin;
  call parse
end.

If you remember, our arrays are always of sizeof(long), or 8 on my OpenBSD/amd64 machine. So even though we can only hold source code files 1 MB in size, it currently takes us 8 MB to store it. Once we have a self-hosting compiler, we will add the packed reserved word, which will allow us to store these as bytes. But we're not there yet. Alternatively, you could discard the raw buffer entirely and just use readChar as you need the next character from the source code.

Speaking of reading in the source code: from where is source code being read? We do not have any mechanism for opening and closing files in our code. We will be using the fact that readChar always reads from stdin and we will have to use the shell to redirect files to stdin. Therefore, when using our self-hosting compiler, we will need to invoke it as follows:

$ pl0c < file.pl0 > output.c

or

$ pl0c < file.pl0 | cc -o file -x c -

For me for now, this is fine. We will eventually want to have real file handling. But that will have to come after we are successfully self-hosting.

Error handling

Because PL/0 does not allow passing parameters into procedures, when we hit an error, we will call an error procedure that prints the line number, then we need to write the pertinent error message and exit. Some day we may want to permit parameters to procedures, but again, not today.

Testing it out

In the spirit of self-hosting, let's end the day with seeing if our code so far can read itself in and print itself out. On my machine, I need this invocation:

$ pl0c new.pl0 | cc -DHAVE_STRTONUM -o new -x c -
$ ./new < new.pl0

And sure enough, we see our own source code.

On the next episode: re-writing the lexer

We're going to be going through the same process as we did the first time. The lexer is next.

Top

RSS