Brian Robert Callahan
academic, developer, with an eye towards a brighter techno-social life
All source code for this blog post can be found here.
We begin a new journey today. It is time to use our compiler, written in C accepting PL/0. What we want to do now is write a new compiler, written in PL/0 accepting PL/0. Because the compiler we have understands PL/0, we can use it until we are at the point where our new compiler is good enough to fully compile itself. Once that happens, we can discard our current compiler and from then on use the new compiler to compile itself.
But we're not going to get there in a day. We will take it, as always, one step at a time. For today, we'll talk out the plan for our self-hosting compiler and code up enough to where we have a program written in PL/0 that can read in source code and print out what it read in.
The quickest way to get to the interesting parts of a self-hosting compiler will be to directly translate (as best we can) the C implementation to PL/0. Because let's be honest: now that we've written a lexer and a parser, those are not particularly interesting problems for us anymore. We still have to write the PL/0 versions, but doing so won't much teach us anything new. We will also translate the C code generator to start out so we can most quickly see our self-hosting compiler actually self-hosting. But after that the real fun begins. We will add additional syntax and remove the C code generator in favor of QBE, which we reviewed in our previous blog post.
As a reminder for ourselves, here is the PL/0 grammar as found in a comment in our first compiler. We'll need this soon for our front end:
/* * pl0c -- PL/0 compiler. * * program = block "." . * block = [ "const" ident "=" number { "," ident "=" number } ";" ] * [ "var" ident [ array ] { "," ident [ array ] } ";" ] * { "forward" ident ";" } * { "procedure" ident ";" block ";" } statement . * statement = [ ident ":=" expression * | "call" ident * | "begin" statement { ";" statement } "end" * | "if" condition "then" statement [ "else" statement ] * | "while" condition "do" statement * | "readInt" [ "into" ] ident * | "readChar" [ "into" ] ident * | "writeInt" expression * | "writeChar" expression * | "writeStr" ( ident | string ) * | "exit" expression ] . * condition = "odd" expression * | expression ( comparator ) expression . * expression = [ "+" | "-" | "not" ] term { ( "+" | "-" | "or" ) term } . * term = factor { ( "*" | "/" | "mod" | "and" ) factor } . * factor = ident * | number * | "(" expression ")" . * comparator = "=" | "#" | "<" | ">" | "<=" | ">=" | "<>" . * array = "size" number . */
For today, let's write a program that reads in source code and writes it out, so that we know things work.
Additionally, to prepare ourselves for writing our front end, I am also going to translate all our #define
s from C to PL/0. That would be to keep the same names but mark them as const
in PL/0. I am also going to add some data structures that we know we'll need, like a symbol table, buffers for token strings, and a line number. Let's write it out and we can discuss:
{ pl0c -- PL/0 compiler written in PL/0. } const CHECK_LHS = 0, CHECK_RHS = 1, CHECK_CALL = 2, TOK_IDENT = 'I', TOK_NUMBER = 'N', TOK_CONST = 'C', TOK_VAR = 'V', TOK_PROCEDURE = 'P', TOK_CALL = 'c', TOK_BEGIN = 'B', TOK_END = 'E', TOK_IF = 'i', TOK_THEN = 'T', TOK_ELSE = 'e', TOK_WHILE = 'W', TOK_DO = 'D', TOK_ODD = 'O', TOK_WRITEINT = 'w', TOK_WRITECHAR = 'H', TOK_WRITESTR = 'S', TOK_READINT = 'R', TOK_READCHAR = 'h', TOK_INTO = 'n', TOK_SIZE = 's', TOK_EXIT = 'X', TOK_AND = '&', TOK_OR = '|', TOK_NOT = '~', TOK_DOT = '.', TOK_EQUAL = '=', TOK_COMMA = ',', TOK_SEMICOLON = ';', TOK_ASSIGN = ':', TOK_HASH = '#', TOK_LTHAN = '<', TOK_GTHAN = '>', TOK_LTHANE = '{', TOK_GTHANE = '}', TOK_PLUS = '+', TOK_MINUS = '-', TOK_MULTIPLY = '*', TOK_DIVIDE = '/', TOK_MODULO = '%', TOK_LPAREN = '(', TOK_RPAREN = ')', TOK_LBRACK = '[', TOK_RBRACK = ']', TOK_STRING = '"' ; var raw size 1048576, { 8 MB, can hold files up to 1 MB in size } symtab size 1048576, { 8 MB, can hold up to 30840 symbols } token size 32, { 31 characters + NUL } str size 32, { For cmpstr } symtabcnt, { To keep track of number of symtab entries } ret, { Return code for procedures that need one } line { line number } ; { Misc. functions } procedure error; begin writeStr 'pl0c: error: '; writeInt line; writeStr ': ' end; procedure readin; var ch, i; begin i := 0; readChar into ch; while ch # -1 do begin raw[i] := ch; i := i + 1; if i = 1048576 then { File too big! } begin call error; writeStr 'file too big\n'; exit 1 end; readChar into ch end; end; { Lexer } { Code generator } { Parser } procedure parse; begin writeStr raw end; { Main } begin call readin; call parse end.
If you remember, our arrays are always of sizeof(long)
, or 8 on my OpenBSD/amd64 machine. So even though we can only hold source code files 1 MB in size, it currently takes us 8 MB to store it. Once we have a self-hosting compiler, we will add the packed
reserved word, which will allow us to store these as bytes. But we're not there yet. Alternatively, you could discard the raw
buffer entirely and just use readChar
as you need the next character from the source code.
Speaking of reading in the source code: from where is source code being read? We do not have any mechanism for opening and closing files in our code. We will be using the fact that readChar
always reads from stdin
and we will have to use the shell to redirect files to stdin
. Therefore, when using our self-hosting compiler, we will need to invoke it as follows:
$ pl0c < file.pl0 > output.c
or
$ pl0c < file.pl0 | cc -o file -x c -
For me for now, this is fine. We will eventually want to have real file handling. But that will have to come after we are successfully self-hosting.
Because PL/0 does not allow passing parameters into procedures, when we hit an error, we will call an error
procedure that prints the line number, then we need to write the pertinent error message and exit. Some day we may want to permit parameters to procedures, but again, not today.
In the spirit of self-hosting, let's end the day with seeing if our code so far can read itself in and print itself out. On my machine, I need this invocation:
$ pl0c new.pl0 | cc -DHAVE_STRTONUM -o new -x c - $ ./new < new.pl0
And sure enough, we see our own source code.
We're going to be going through the same process as we did the first time. The lexer is next.