Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life

2021-08-18
Let's write a compiler, part 5: A code generator

All source code for this blog post can be found here.

A PL/0 code validator, the combination of a lexer and parser into a complete front end for a compiler, is surprisingly useful. But it's also not nearly as useful as a complete PL/0 compiler is, since a complete compiler will do everything a validator can do, plus output a new representation of that code that is hopefully closer to machine code than when we started. Or, if not closer to machine code, in a representation that can be given to a compiler we already have for another language.

We have such a language and compiler sitting around: C. Makes sense, as the compiler itself is written in C and we've been using a C compiler to compile our compiler. Having our PL/0 compiler target C will make a lot of things easy on us for sure, but it will come at the cost of us skirting some very interesting challenges. With that said, I am a fan of each of these blog posts having us come away completing something whose output we can use immediately. A C generating back end will let us do that today. We can always come back later and write an assembly generating back end later, too. Also, it's our first compiler so I'm OK with making something simple that works. Besides, we should leave at least some interesting problems for all those compiler textbooks! At least for now.

After today, we will be able to generate correct (though, perhaps not so good looking) C code for all of the PL/0 code in our grammar as it exists right now. Sadly, that does not include things like basic input and output. We'll tackle that tomorrow because it'll end up being a little bit different than the code generation we will deal with today.

Code organization

In the compiler source code itself, I am going to place the code generator between the lexer and the parser. The lexer does not talk to the code generator, if you remember our action chart from the first post in this series. And the code generator never talks to the lexer or the parser. The only communication is from parser to code generator.

Additionally, I am going to prefix all the code generator functions that the parser uses with cg_ so that we can make reading through the parser code a little easier by making it clear which functions relate to the work of parsing and which do not.

Writing to stdout

All the code generator functions will call a function named aout() at least once. This function is the only function that does the work of writing to stdout. It looks like this:

/*
 * Code generator.
 */

static void
aout(const char *fmt, ...)
{
	va_list ap;

	va_start(ap, fmt);
	(void) vfprintf(stdout, fmt, ap);
	va_end(ap);
}

It's basically a shorthand for (void) fprintf(stdout, "%s", ...); because we are going to use aout() a whole lot and it's shorter to type that way. This is the only function fully internal to the code generator. All the other functions we write today will be prefixed with cg_ and thus called by the parser.

Our first code generation

The first function I'd like to write we will call cg_end(). It gets called when we know parsing is complete and successful, meaning that we have a valid PL/0 program. It just prints a comment:

static void
cg_end(void)
{

	aout("/* PL/0 compiler %s */\n", PL0C_VERSION);
}

It makes it easy to see at a glance if compilation was successful, since you'll either see this comment or you won't. I also added a #define PL0C_VERSION "1.0.0" to the top of the source code. We will increment this number as we further develop the compiler in future episodes. The cg_end() function gets called at the very end of the parser:

static void
parse(void)
{

	next();
	block();
	expect(TOK_DOT);

	if (type != 0)
		error("extra tokens at end of file");

	cg_end();
}

The workflow of writing a code generator

Now we need to follow the parser. We are going to imagine the compiler has been given a PL/0 program that uses every possible branch, every possible statement, in our parser. As we follow the parser as it works through this imaginary program, our workflow for writing the code generator will roughly look like this:

+-------+
| Start |
+-------+
    |                 No
    +-+---------------------------------+
      |                                 |
      V                                 |
+-----------+    +----------+      ------------
| Execute   |    | Store    |     / Ready to   \
| parser    |--->|          |--->|              |
| statement |    | metadata |     \ emit code? /
+-----------+    +----------+      ------------
      ^                                 |
      |                                 | Yes
      |                                 |
      |            ---------            V
      |    No     / End of  \     +-----------+
      +----------|           |<---| Emit code |
                  \ parser? /     +-----------+
                   ---------
                       |
                       |Yes +-----+
                       +--->| End |
                            +-----+

And that's missing some nuance as well, but it's a good rough approximation. In some places during the parsing, we will discard some metadata. But if we stick to thinking about this workflow, we will find that most of the coding we need to do will arrive at clear spots. And then we can point out the special cases as needed.

Generating code for constants

Following the parser from the top, the first real code we deal with is in block(). And within block(), the first thing we deal with is constants. This makes sense, since we are really just following the PL/0 grammar. The first thing we do is check to see if we have a TOK_CONST and if we do, we enter the constant declaration section. Let's presume that happens. We then run expect(TOK_CONST); to consume the TOK_CONST and have the lexer fetch and return the next token. This should be a TOK_IDENT, the name of the constant.

And it is here that we'll encounter a little catch-22. We have enough information right now to output some code. But we don't want to call expect(TOK_IDENT); just yet, or else we'll discard this token for the next one. But we also don't want to output any code if we don't have a TOK_IDENT. So the general pattern I am going to follow is right before the expect(TOK_IDENT);, we'll do a "code generator check" to ensure that we have the correct token type and if we do, then we'll call the code generator to generate the code we need.

In the case that we don't have a TOK_IDENT here, this "code generator check" will fail, and then the expect(TOK_IDENT); immediately after will also fail and generate an error. So all will remain correct.

You could say this is a little inefficient. But the goal here isn't to win any efficiency awards. It's to be correct. Besides, I do not think anyone will have a problem with the speed at which this compiler will be able to compile PL/0 code. And if they do, these extra simple integer comparisons won't be the culprit.

Let's see what this looks like in action:

	if (type == TOK_CONST) {
		expect(TOK_CONST);
		if (type == TOK_IDENT)
			cg_const();
		expect(TOK_IDENT);
		expect(TOK_EQUAL);
		expect(TOK_NUMBER);
		while (type == TOK_COMMA) {
			expect(TOK_COMMA);
			if (type == TOK_IDENT)
				cg_const();
			expect(TOK_IDENT);
			expect(TOK_EQUAL);
			expect(TOK_NUMBER);
		}
		expect(TOK_SEMICOLON);
	}

Remember to put the cg_const() call in both spots because you can have multiple constants.

Now, what code do we know we can output here? We know we can (and must) output the C version of a constant declaration! However, we don't yet know what the constant will be initialized to. That's OK. We don't need to know the entire statement before outputting C code. We just need to be in a position where there is a single, unambiguous, possibility for what the C code output could be. Once we're in that position, we should write out that C code, exactly that C code, and nothing but that C code. We can always return to the code generator later to finish the statement if it isn't complete.

What C code makes sense here? The cg_const() function should write a constant declaration without the number the constant will be initialized to. That looks like this:

static void
cg_const(void)
{

	aout("const long %s=", token);
}

As always, this is not the only way things could be done. This is simply for me the most straightforward way to do it.

All our C variables will be longs. Why? Because that's what the lexer validates all numbers to be. You might also see that the code formatting for our outputted C code is quite poor. There's really little win for us here to be nitpicky about the style of the C code the compiler outputs as that will just add lots of extra complexity for very little benefit. If you want the C code to be formatted nicely, we can do something like this:

$ pl0c file.pl0 | clang-format

Presuming you have clang installed on your system, of course. If not, you can use some other code formatter.

Let's finish up generating code for constants. Same idea here: we have all the information we need to finish this line of C code once we have consumed the equals sign. Because now, if it is valid PL/0 code, we have a number. We'll do the same thing, right before we expect(TOK_NUMBER);, we'll do a "code generator check" for TOK_NUMBER and if true output the number.

All together, that looks like this:

	if (type == TOK_CONST) {
		expect(TOK_CONST);
		if (type == TOK_IDENT)
			cg_const();
		expect(TOK_IDENT);
		expect(TOK_EQUAL);
		if (type == TOK_NUMBER) {
			cg_symbol();
			cg_semicolon();
		}
		expect(TOK_NUMBER);
		while (type == TOK_COMMA) {
			expect(TOK_COMMA);
			if (type == TOK_IDENT)
				cg_const();
			expect(TOK_IDENT);
			expect(TOK_EQUAL);
			if (type == TOK_NUMBER) {
				cg_symbol();
				cg_semicolon();
			}
			expect(TOK_NUMBER);
		}
		expect(TOK_SEMICOLON);
	}

I used two code generation functions to complete the C output: cg_symbol() and cg_semicolon(). The cg_semicolon() statement is used wherever we need a semicolon. To prevent very long lines and at least give us some semblance of sane C code output, it prints a semicolon followed by a newline:

static void
cg_semicolon(void)
{

	aout(";\n");
}

And cg_symbol() is a catch all for printing whatever the current symbol happens to be. For TOK_IDENT and TOK_NUMBER, that's the token string. For everything else, it's the C equivalent of the PL/0 token:

static void
cg_symbol(void)
{

	switch (type) {
	case TOK_IDENT:
	case TOK_NUMBER:
		aout("%s", token);
		break;
	case TOK_BEGIN:
		aout("{\n");
		break;
	case TOK_END:
		aout(";\n}\n");
		break;
	case TOK_IF:
		aout("if(");
		break;
	case TOK_THEN:
	case TOK_DO:
		aout(")");
		break;
	case TOK_ODD:
		aout("(");
		break;
	case TOK_WHILE:
		aout("while(");
		break;
	case TOK_EQUAL:
		aout("==");
		break;
	case TOK_COMMA:
		aout(",");
		break;
	case TOK_ASSIGN:
		aout("=");
		break;
	case TOK_HASH:
		aout("!=");
		break;
	case TOK_LESSTHAN:
		aout("<");
		break;
	case TOK_GREATERTHAN:
		aout(">");
		break;
	case TOK_PLUS:
		aout("+");
		break;
	case TOK_MINUS:
		aout("-");
		break;
	case TOK_MULTIPLY:
		aout("*");
		break;
	case TOK_DIVIDE:
		aout("/");
		break;
	case TOK_LPAREN:
		aout("(");
		break;
	case TOK_RPAREN:
		aout(")");
	}
}

There are some odd-looking ones in that list, e.g., TOK_ODD, but we'll talk about them when we get to them.

Here, we want to output the current symbol, since we know it is a number. Then we want to output a semicolon to finish the C statement.

We are now generating correct C code for any and all const sections.

A symbol table

Not so fast though. While we are generating correct C code for const sections, we are missing a really important aspect of what a compiler does: it must enforce some semantic understanding of the source code it reads. We can think of the two problems as such: syntax checks only that the token are arranged in a form that satisfies the grammar; semantics checks that the meaning of these tokens in this arrangement is understandable.

A quick demonstration in English:

"The professor a lecture." The syntax rules of English require a verb in this sentence, yet none are present. Parsing this sentence should raise a syntax error.
"The pinball machine watered a lecture." This sentence satisfies the syntax rules of English but has no meaning. A semantic check should flag this as not understandable.

The first piece of semantic understanding we will tackle is knowing what symbols have been declared and thus are available to be referenced, ensuring that no symbols with identical names exist at the same scope (depth level, as we called it), and making adjustments to that understanding as we read in new constant and variable declarations and enter and exit procedures.

Yes, if we wanted to be really sloppy, we could defer all this knowledge to the C compiler. But that's really bad style, and won't work when we return later and write an assembly back end. We should write a compiler that only ever outputs correct C code. And so, we will need to have a symbol table.

Our symbol table will need to keep track of the following information: the name of the symbol, if the symbol represents a TOK_CONST, TOK_VAR, or TOK_PROCEDURE, and the depth level of the symbol (to allow for variable shadowing). It will need to add constants and variables as it encounters them. It will also need to add procedures as it encounters them. However, the symbol table will need to destroy all the local constants and variables inside each procedure when the procedure is exited by the parser. This is because you ought to be able to reuse the same local variable names within different procedures, since those variables live in different scopes. But the symbol table must never forget the names of the procedures themselves, since that could lead to a situation where two procedures have the same name and it would be impossible to disambiguate a call to that name.

Ugh, complicated.

Let's try to uncomplicate things as much as we can. First, here is some ASCII art of the life of a symbol table to depict what I'm getting at. In order, we see the symbol table at the beginning of execution, after all globals have been declared, after a procedure and its locals have been declared, and at the end of processing a procedure:

+---------------+ +---------------+ +---------------+ +---------------+
| Sentinel      | | Sentinel      | | Sentinel      | | Sentinel      |
+---------------+ +---------------+ +---------------+ +---------------+
                  | Global consts | | Global consts | | Global consts |
                  +---------------+ +---------------+ +---------------+
                  | Global vars   | | Global vars   | | Global vars   |
                  +---------------+ +---------------+ +---------------+
                                    | Procedure     | | Procedure     |
                                    +---------------+ +---------------+
                                    | Local consts  |
                                    +---------------+
                                    | Local vars    |
                                    +---------------+

Here is the basic data structure for a symbol table entry:

struct symtab {
	int depth;
	int type;
	char *name;
	struct symtab *next;
};
static struct symtab *head;

We can get away with a singly linked list here. It isn't the fastest data structure we could use, but again, it will be correct and that's our goal here today. We also start the list with a head pointer.

I am going to initialize the symbol table with a sentinel before we even enter the parser, which will always be the first node in the list. This sentinel will represent the "main" procedure within which all our procedures are nested. I'll make it literally a procedure named main; it will prevent a procedure with that name from being created, causing issues when the C compiler takes over. It also prevents us from having to do a "is the symbol table empty or not" dance whenever we add symbols. We'll know the symbol table always has at least one symbol in it:

static void
initsymtab(void)
{
	struct symtab *new;

	if ((new = malloc(sizeof(struct symtab))) == NULL)
		error("malloc failed");

	new->depth = 0;
	new->type = TOK_PROCEDURE;
	new->name = "main";
	new->next = NULL;

	head = new;
}

OK, now we're also ready to add a function to add symbols the compiler finds to the symbol table:

static void
addsymbol(int type)
{
	struct symtab *curr, *new;

	curr = head;
	while (1) {
		if (!strcmp(curr->name, token)) {
			if (curr->depth == (depth - 1))
				error("duplicate symbol: %s", token);
		}

		if (curr->next == NULL)
			break;

		curr = curr->next;
	}

	if ((new = malloc(sizeof(struct symtab))) == NULL)
		error("malloc failed");

	new->depth = depth - 1;
	new->type = type;
	if ((new->name = strdup(token)) == NULL)
		error("malloc failed");
	new->next = NULL;

	curr->next = new;
}

We read through the entire linked list. If there is a symbol name match, we check to make sure the depth level isn't the same. If it is, that's a duplicate symbol error. You'll notice we always do this depth - 1 thing; that's because in block(), we increment the depth level immediately after the check to make sure we don't have nested procedures and so the depth variable is always one level higher than we're currently at.

Once we're at the last item in the list and checked that there's no duplicate, we create a new symbol. We record the token string, whether this identifier had a TOK_CONST, TOK_VAR, or TOK_PROCEDURE preceding it, and the depth level. Because we are at the end of the list, there is no next for this symbol. We then attach this new symbol to the next of the current last symbol.

Finally, we need a function to destroy symbols. Like I mentioned, we don't want to destroy all symbols. We only want to destroy the local TOK_CONST and TOK_VAR symbols for the procedure the parser just left. We want to leave all the TOK_PROCEDURE symbols. At the end of compilation, our symbol table should be a list beginning with the sentinel, followed by global constants and variables, and then listing all the TOK_PROCEDURE names in the source code file.

Since the TOK_PROCEDURE symbol is always the first symbol recorded and its local TOK_CONST and TOK_VAR symbols come after, we can jump to the end of the list and check to see if this symbol is a TOK_PROCEDURE. If it is not, we should destroy it, as it's a local TOK_CONST or TOK_VAR. We should keep destroying symbols until we find a TOK_PROCEDURE at the end of the list:

static void
destroysymbols(void)
{
	struct symtab *curr, *prev;

again:
	curr = head;
	while (curr->next != NULL) {
		prev = curr;
		curr = curr->next;
	}

	if (curr->type != TOK_PROCEDURE) {
		free(curr->name);
		free(curr);
		prev->next = NULL;
		goto again;
	}
}

Again, if you're allergic to goto, you can place this in a loop.

You'll note that we never fully clean up after ourselves with the linked list. The compiler is such a short-lived program a typical strategy is to just let the operating system clean up after us when the compiler exits.

Really finishing constant sections

OK, now we can place addsymbol() into the parser and be finished with constant sections for good. Here is, finally, our completed constant section code:

	if (type == TOK_CONST) {
		expect(TOK_CONST);
		if (type == TOK_IDENT) {
			addsymbol(TOK_CONST);
			cg_const();
		}
		expect(TOK_IDENT);
		expect(TOK_EQUAL);
		if (type == TOK_NUMBER) {
			cg_symbol();
			cg_semicolon();
		}
		expect(TOK_NUMBER);
		while (type == TOK_COMMA) {
			expect(TOK_COMMA);
			if (type == TOK_IDENT) {
				addsymbol(TOK_CONST);
				cg_const();
			}
			expect(TOK_IDENT);
			expect(TOK_EQUAL);
			if (type == TOK_NUMBER) {
				cg_symbol();
				cg_semicolon();
			}
			expect(TOK_NUMBER);
		}
		expect(TOK_SEMICOLON);
	}

Code generation for variable sections

We can use everything we just learned to write the code generation for constant sections and symbol table adding to add the necessary bits to process variable sections:

	if (type == TOK_VAR) {
		expect(TOK_VAR);
		if (type == TOK_IDENT) {
			addsymbol(TOK_VAR);
			cg_var();
		}
		expect(TOK_IDENT);
		while (type == TOK_COMMA) {
			expect(TOK_COMMA);
			if (type == TOK_IDENT) {
				addsymbol(TOK_VAR);
				cg_var();
			}
			expect(TOK_IDENT);
		}
		expect(TOK_SEMICOLON);
		cg_crlf();
	}

One new code generation function which is just for style: cg_crlf() which adds a newline to the output C code:

static void
cg_crlf(void)
{

	aout("\n");
}

It only gets used in this one place but I would like to enforce the coding style where aout() is fully internal to the code generator, hence this wrapper function.

Here is the code for cg_var() which looks surprisingly similar to the cg_const() function we just wrote:

static void
cg_var(void)
{

	aout("long %s;\n", token);
}

Unlike constants, which must be initialized, variables may not be initialized. So we can write a complete C statement here. And that's all we need to do. We have completed code generation for all constant and variable sections! We're making good progress.

Code generation for procedures

Things will begin to get more difficult. We have this strange situation from a C perspective where in PL/0 all the procedures technically live inside what we might consider to be the main() function, which isn't a thing in C. Also, the main procedure itself has no name. We'll need to have some way of tracking whether or not we are in a named procedure or in the unnamed main procedure. We'll create a new global variable named proc to keep track of that information.

Other than that, the routine is nearly the same as we did for constants and variables:

	while (type == TOK_PROCEDURE) {
		proc = 1;

		expect(TOK_PROCEDURE);
		if (type == TOK_IDENT) {
			addsymbol(TOK_PROCEDURE);
			cg_procedure();
		}
		expect(TOK_IDENT);
		expect(TOK_SEMICOLON);

		block();

		expect(TOK_SEMICOLON);

		proc = 0;

		destroysymbols();
	}

	if (proc == 0)
		cg_procedure();

Remember: at the end of each procedure we need to destroy all the TOK_CONST and TOK_VAR symbols local to this procedure. We also need to reset the global proc variable back to 0 because at some point we will enter that main() block and need to deal with that.

The code to generate a function in C looks like this:

static void
cg_procedure(void)
{

	if (proc == 0) {
		aout("int\n");
		aout("main(int argc, char *argv[])\n");
	} else {
		aout("void\n");
		aout("%s(void)\n", token);
	}

	aout("{\n");
}

Finally, I'll end each C function with a little epilogue that closes things up and if we're in main() adds a nice return 0; line:

static void
cg_epilogue(void)
{

	aout(";");

	if (proc == 0)
		aout("return 0;");

	aout("\n}\n\n");
}

We'll probably have a bunch of extra semicolons in our output C code. We have to account for the fact that in Pascal-like languages, the semicolon is a statement separator but in C-like languages, the semicolon is the statement terminator. Empty statements are legal in C and do nothing, so they don't detract from the correctness of our output C code so I am not worried about liberal production of empty statements in the code generator.

All together now, here is our finished block() function that will generate correct C code for all constant, variable, and procedure declarations:

static void
block(void)
{

	if (depth++ > 1)
		error("nesting depth exceeded");

	if (type == TOK_CONST) {
		expect(TOK_CONST);
		if (type == TOK_IDENT) {
			addsymbol(TOK_CONST);
			cg_const();
		}
		expect(TOK_IDENT);
		expect(TOK_EQUAL);
		if (type == TOK_NUMBER) {
			cg_symbol();
			cg_semicolon();
		}
		expect(TOK_NUMBER);
		while (type == TOK_COMMA) {
			expect(TOK_COMMA);
			if (type == TOK_IDENT) {
				addsymbol(TOK_CONST);
				cg_const();
			}
			expect(TOK_IDENT);
			expect(TOK_EQUAL);
			if (type == TOK_NUMBER) {
				cg_symbol();
				cg_semicolon();
			}
			expect(TOK_NUMBER);
		}
		expect(TOK_SEMICOLON);
	}

	if (type == TOK_VAR) {
		expect(TOK_VAR);
		if (type == TOK_IDENT) {
			addsymbol(TOK_VAR);
			cg_var();
		}
		expect(TOK_IDENT);
		while (type == TOK_COMMA) {
			expect(TOK_COMMA);
			if (type == TOK_IDENT) {
				addsymbol(TOK_VAR);
				cg_var();
			}
			expect(TOK_IDENT);
		}
		expect(TOK_SEMICOLON);
		cg_crlf();
	}

	while (type == TOK_PROCEDURE) {
		proc = 1;

		expect(TOK_PROCEDURE);
		if (type == TOK_IDENT) {
			addsymbol(TOK_PROCEDURE);
			cg_prologue();
		}
		expect(TOK_IDENT);
		expect(TOK_SEMICOLON);

		block();

		expect(TOK_SEMICOLON);

		proc = 0;

		destroysymbols();
	}

	if (proc == 0)
		cg_prologue();

	statement();

	cg_epilogue();

	if (--depth < 0)
		error("nesting depth fell below 0");
}

Code generation for statements

We're doing great. Let's keep up the momentum and dive into code generation for statements. If we can get this done, we'll be finished with our compiler!

This will take a little more planning, as there will be some interesting problems we will need to solve. The first problem we need to solve is a semantic one: the grammar says an assignment statement is an identifier followed by an assignment operator followed by an expression. But an identifier can represent a constant, a variable, or a procedure. Only for variables does assignment make sense. Good thing our symbol table records that information!

Left-hand side versus right-hand side

Whenever an identifier is referenced in a statement, we need to know two things. First, we need to know if that identifier was previously declared since we can only reference identifiers that were previously declared. We must always know this no matter where in the statement the identifier is found. We also need to know if this identifier makes semantic sense for its location. That is roughly divided into the left-hand side and the right-hand side of the assignment operator. On the left-hand side, we can only have identifiers declared as TOK_VAR. On the right-hand side, we can have either a TOK_VAR or a TOK_CONST but not a TOK_PROCEDURE. We also need to make sure that if we have a call statement, the identifier that follows is a TOK_PROCEDURE. I am going to create three new #defines: CHECK_LHS, CHECK_RHS, and CHECK_CALL that we can use as a parameter to a new function called symcheck() that does this semantic checking for us. The function looks like this:

/*
 * Semantics.
 */

static void
symcheck(int check)
{
	struct symtab *curr, *ret = NULL;

	curr = head;
	while (curr != NULL) {
		if (!strcmp(token, curr->name))
			ret = curr;
		curr = curr->next;
	}

	if (ret == NULL)
		error("undefined symbol: %s", token);

	switch (check) {
	case CHECK_LHS:
		if (ret->type != TOK_VAR)
			error("must be a variable: %s", token);
		break;
	case CHECK_RHS:
		if (ret->type == TOK_PROCEDURE)
			error("must not be a procedure: %s", token);
		break;
	case CHECK_CALL:
		if (ret->type != TOK_PROCEDURE)
			error("must be a procedure: %s", token);
	}
}

It's also important that we check the correct identifier if there is a local identifier shadowing a global identifier. That's why we iterate down the whole linked list even if we find a match.

Returning to code generation for statements

Now that we have semantic checking in place, we can return to code generation for statements. Let's start at the top and move on down. First is assignment statements. We want to check that we have a left-hand side identifier, output that symbol to our C code, output the C assignment operator, then deal with an expression, which we'll worry about later. In code:

	switch (type) {
	case TOK_IDENT:
		symcheck(CHECK_LHS);
		cg_symbol();
		expect(TOK_IDENT);
		if (type == TOK_ASSIGN)
			cg_symbol();
		expect(TOK_ASSIGN);
		expression();
		break;

Next is a call statement. We need to check to ensure the identifier is a TOK_PROCEDURE and we'll probably need a new code generator function to handle calls. In code:

	case TOK_CALL:
		expect(TOK_CALL);
		if (type == TOK_IDENT) {
			symcheck(CHECK_CALL);
			cg_call();
		}
		expect(TOK_IDENT);
		break;

As mentioned, we need a new code generator function for cg_call():

static void
cg_call(void)
{

	aout("%s();\n", token);
}

Next is a begin ... end statement. Both begin and end are handled in cg_symbol() so no new code generator functions to write. We also need to remember that the semicolon is the statement separator and output a semicolon when we encounter one:

	case TOK_BEGIN:
		cg_symbol();
		expect(TOK_BEGIN);
		statement();
		while (type == TOK_SEMICOLON) {
			cg_semicolon();
			expect(TOK_SEMICOLON);
			statement();
		}
		if (type == TOK_END)
			cg_symbol();
		expect(TOK_END);
		break;

Next is an if ... then statement. Like begin and end, if and then are handled in cg_symbol() so no new code generator functions. Like expression(), we'll defer worrying about condition() until we get there:

	case TOK_IF:
		cg_symbol();
		expect(TOK_IF);
		condition();
		if (type == TOK_THEN)
			cg_symbol();
		expect(TOK_THEN);
		statement();
		break;

Last is the while ... do statement. It is nearly identical to an if ... then statement:

	case TOK_WHILE:
		cg_symbol();
		expect(TOK_WHILE);
		condition();
		if (type == TOK_DO)
			cg_symbol();
		expect(TOK_DO);
		statement();
		break;

All together:

static void
statement(void)
{

	switch (type) {
	case TOK_IDENT:
		symcheck(CHECK_LHS);
		cg_symbol();
		expect(TOK_IDENT);
		if (type == TOK_ASSIGN)
			cg_symbol();
		expect(TOK_ASSIGN);
		expression();
		break;
	case TOK_CALL:
		expect(TOK_CALL);
		if (type == TOK_IDENT) {
			symcheck(CHECK_CALL);
			cg_call();
		}
		expect(TOK_IDENT);
		break;
	case TOK_BEGIN:
		cg_symbol();
		expect(TOK_BEGIN);
		statement();
		while (type == TOK_SEMICOLON) {
			cg_semicolon();
			expect(TOK_SEMICOLON);
			statement();
		}
		if (type == TOK_END)
			cg_symbol();
		expect(TOK_END);
		break;
	case TOK_IF:
		cg_symbol();
		expect(TOK_IF);
		condition();
		if (type == TOK_THEN)
			cg_symbol();
		expect(TOK_THEN);
		statement();
		break;
	case TOK_WHILE:
		cg_symbol();
		expect(TOK_WHILE);
		condition();
		if (type == TOK_DO)
			cg_symbol();
		expect(TOK_DO);
		statement();
		break;
	}
}

Code generation for conditions

Now we can write the code generation for conditions. The odd one out is TOK_ODD. That will need a new code generator function. But the arrangement is interesting. The TOK_ODD symbol is handled in cg_symbol() but it just opens a parentheses for the expression to live inside. The interesting part of an odd condition comes at the end:

static void
condition(void)
{

	if (type == TOK_ODD) {
		cg_symbol();
		expect(TOK_ODD);
		expression();
		cg_odd();
	} else {
		expression();

		switch (type) {
		case TOK_EQUAL:
		case TOK_HASH:
		case TOK_LESSTHAN:
		case TOK_GREATERTHAN:
			cg_symbol();
			next();
			break;
		default:
			error("invalid conditional");
		}

		expression();
	}
}

The cg_odd() function needs to output code that tests if the expression is even or odd. The quick and easy way to do that is to use expression & 1 to check if the last bit is 1 or 0. That yields this function:

static void
cg_odd(void)
{

	aout(")&1");
}

Code generation for expressions

All an expression needs to do is write out a + or a - at the right time:

static void
expression(void)
{

	if (type == TOK_PLUS || type == TOK_MINUS) {
		cg_symbol();
		next();
	}

	term();

	while (type == TOK_PLUS || type == TOK_MINUS) {
		cg_symbol();
		next();
		term();
	}
}

Code generation for terms

Like expressions, all terms need to do is write out * or / at the right time:

static void
term(void)
{

	factor();

	while (type == TOK_MULTIPLY || type == TOK_DIVIDE) {
		cg_symbol();
		next();
		factor();
	}
}

Code generation for factors

Lastly is factors. A factor can either be a right-hand side identifier, a number, or an expression wrapped in parentheses:

static void
factor(void)
{

	switch (type) {
	case TOK_IDENT:
		symcheck(CHECK_RHS);
		/* Fallthru */
	case TOK_NUMBER:
		cg_symbol();
		next();
		break;
	case TOK_LPAREN:
		cg_symbol();
		expect(TOK_LPAREN);
		expression();
		if (type == TOK_RPAREN)
			cg_symbol();
		expect(TOK_RPAREN);
	}
}

We did it! That's all the functions. We have a completed code generator, and by extension a completed compiler!

Why this works

While we did solve a lot of interesting problems to get here, there are even more problems to solve if we were to generate assembly code. If we were generating assembly, we would need to solve interesting problems such as converting infix notation into postfix notation for assembly output. We got lucky: it turns out that a valid PL/0 expression is also a valid C expression so we were able to just write out what we saw when we saw it.

If we were compiling to assembly, we would also need to consider questions about variable referencing, using a stack, efficient register usage, and perhaps some rudimentary optimization.

These are interesting problems well worth learning how to solve. We can return to them later. We did a lot today, and I think it's time to put away the text editor until tomorrow.

All together now

Here is our complete compiler:

#include <sys/stat.h>

#include <ctype.h>
#include <fcntl.h>
#include <limits.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define PL0C_VERSION	"1.0.0"

#define CHECK_LHS	0
#define CHECK_RHS	1
#define CHECK_CALL	2

#define TOK_IDENT	'I'
#define TOK_NUMBER	'N'
#define TOK_CONST	'C'
#define TOK_VAR		'V'
#define TOK_PROCEDURE	'P'
#define TOK_CALL	'c'
#define TOK_BEGIN	'B'
#define TOK_END		'E'
#define TOK_IF		'i'
#define TOK_THEN	'T'
#define TOK_WHILE	'W'
#define TOK_DO		'D'
#define TOK_ODD		'O'
#define TOK_DOT		'.'
#define TOK_EQUAL	'='
#define TOK_COMMA	','
#define TOK_SEMICOLON	';'
#define TOK_ASSIGN	':'
#define TOK_HASH	'#'
#define TOK_LESSTHAN	'<'
#define TOK_GREATERTHAN	'>'
#define TOK_PLUS	'+'
#define TOK_MINUS	'-'
#define TOK_MULTIPLY	'*'
#define TOK_DIVIDE	'/'
#define TOK_LPAREN	'('
#define TOK_RPAREN	')'

/*
 * pl0c -- PL/0 compiler.
 *
 * program	= block "." .
 * block	= [ "const" ident "=" number { "," ident "=" number } ";" ]
 *		  [ "var" ident { "," ident } ";" ]
 *		  { "procedure" ident ";" block ";" } statement .
 * statement	= [ ident ":=" expression
 *		  | "call" ident
 *		  | "begin" statement { ";" statement } "end"
 *		  | "if" condition "then" statement
 *		  | "while" condition "do" statement ] .
 * condition	= "odd" expression
 *		| expression ( "=" | "#" | "<" | ">" ) expression .
 * expression	= [ "+" | "-" ] term { ( "+" | "-" ) term } .
 * term		= factor { ( "*" | "/" ) factor } .
 * factor	= ident
 *		| number
 *		| "(" expression ")" .
 */

static char *raw, *token;
static int depth, proc, type;
static size_t line = 1;

struct symtab {
	int depth;
	int type;
	char *name;
	struct symtab *next;
};
static struct symtab *head;

static void expression(void);

/*
 * Misc. functions.
 */

static void
error(const char *fmt, ...)
{
	va_list ap;

	(void) fprintf(stderr, "pl0c: error: %lu: ", line);

	va_start(ap, fmt);
	(void) vfprintf(stderr, fmt, ap);
	va_end(ap);

	(void) fputc('\n', stderr);

	exit(1);
}

static void
readin(char *file)
{
	int fd;
	struct stat st;

	if (strrchr(file, '.') == NULL)
		error("file must end in '.pl0'");

	if (!!strcmp(strrchr(file, '.'), ".pl0"))
		error("file must end in '.pl0'");

	if ((fd = open(file, O_RDONLY)) == -1)
		error("couldn't open %s", file);

	if (fstat(fd, &st) == -1)
		error("couldn't get size");

	if ((raw = malloc(st.st_size + 1)) == NULL)
		error("malloc failed");

	if (read(fd, raw, st.st_size) != st.st_size)
		error("couldn't read %s", file);
	raw[st.st_size] = '\0';

	(void) close(fd);
}

/*
 * Lexer.
 */

static void
comment(void)
{
	int ch;

	while ((ch = *raw++) != '}') {
		if (ch == '\0')
			error("unterminated comment");

		if (ch == '\n')
			++line;
	}
}

static int
ident(void)
{
	char *p;
	size_t i, len;

	p = raw;
	while (isalnum(*raw) || *raw == '_')
		++raw;

	len = raw - p;

	--raw;

	free(token);

	if ((token = malloc(len + 1)) == NULL)
		error("malloc failed");

	for (i = 0; i < len; i++)
		token[i] = *p++;
	token[i] = '\0';

	if (!strcmp(token, "const"))
		return TOK_CONST;
	else if (!strcmp(token, "var"))
		return TOK_VAR;
	else if (!strcmp(token, "procedure"))
		return TOK_PROCEDURE;
	else if (!strcmp(token, "call"))
		return TOK_CALL;
	else if (!strcmp(token, "begin"))
		return TOK_BEGIN;
	else if (!strcmp(token, "end"))
		return TOK_END;
	else if (!strcmp(token, "if"))
		return TOK_IF;
	else if (!strcmp(token, "then"))
		return TOK_THEN;
	else if (!strcmp(token, "while"))
		return TOK_WHILE;
	else if (!strcmp(token, "do"))
		return TOK_DO;
	else if (!strcmp(token, "odd"))
		return TOK_ODD;

	return TOK_IDENT;
}

static int
number(void)
{
	const char *errstr;
	char *p;
	size_t i, j = 0, len;

	p = raw;
	while (isdigit(*raw) || *raw == '_')
		++raw;

	len = raw - p;

	--raw;

	free(token);

	if ((token = malloc(len + 1)) == NULL)
		error("malloc failed");

	for (i = 0; i < len; i++) {
		if (isdigit(*p))
			token[j++] = *p;
		++p;
	}
	token[j] = '\0';

	(void) strtonum(token, 0, LONG_MAX, &errstr);
	if (errstr != NULL)
		error("invalid number: %s", token);

	return TOK_NUMBER;
}

static int
lex(void)
{

again:
	/* Skip whitespace.  */
	while (*raw == ' ' || *raw == '\t' || *raw == '\n') {
		if (*raw++ == '\n')
			++line;
	}

	if (isalpha(*raw) || *raw == '_')
		return ident();

	if (isdigit(*raw))
		return number();

	switch (*raw) {
	case '{':
		comment();
		goto again;
	case '.':
	case '=':
	case ',':
	case ';':
	case '#':
	case '<':
	case '>':
	case '+':
	case '-':
	case '*':
	case '/':
	case '(':
	case ')':
		return (*raw);
	case ':':
		if (*++raw != '=')
			error("unknown token: ':%c'", *raw);

		return TOK_ASSIGN;
	case '\0':
		return 0;
	default:
		error("unknown token: '%c'", *raw);
	}

	return 0;
}

/*
 * Code generator.
 */

static void
aout(const char *fmt, ...)
{
	va_list ap;

	va_start(ap, fmt);
	(void) vfprintf(stdout, fmt, ap);
	va_end(ap);
}

static void
cg_call(void)
{

	aout("%s();\n", token);
}

static void
cg_const(void)
{

	aout("const long %s=", token);
}

static void
cg_crlf(void)
{

	aout("\n");
}

static void
cg_end(void)
{

	aout("/* PL/0 compiler %s */\n", PL0C_VERSION);
}

static void
cg_epilogue(void)
{

	aout(";");

	if (proc == 0)
		aout("return 0;");

	aout("\n}\n\n");
}

static void
cg_odd(void)
{

	aout(")&1");
}

static void
cg_procedure(void)
{

	if (proc == 0) {
		aout("int\n");
		aout("main(int argc, char *argv[])\n");
	} else {
		aout("void\n");
		aout("%s(void)\n", token);
	}

	aout("{\n");
}

static void
cg_semicolon(void)
{

	aout(";\n");
}

static void
cg_symbol(void)
{

	switch (type) {
	case TOK_IDENT:
	case TOK_NUMBER:
		aout("%s", token);
		break;
	case TOK_BEGIN:
		aout("{\n");
		break;
	case TOK_END:
		aout(";\n}\n");
		break;
	case TOK_IF:
		aout("if(");
		break;
	case TOK_THEN:
	case TOK_DO:
		aout(")");
		break;
	case TOK_ODD:
		aout("(");
		break;
	case TOK_WHILE:
		aout("while(");
		break;
	case TOK_EQUAL:
		aout("==");
		break;
	case TOK_COMMA:
		aout(",");
		break;
	case TOK_ASSIGN:
		aout("=");
		break;
	case TOK_HASH:
		aout("!=");
		break;
	case TOK_LESSTHAN:
		aout("<");
		break;
	case TOK_GREATERTHAN:
		aout(">");
		break;
	case TOK_PLUS:
		aout("+");
		break;
	case TOK_MINUS:
		aout("-");
		break;
	case TOK_MULTIPLY:
		aout("*");
		break;
	case TOK_DIVIDE:
		aout("/");
		break;
	case TOK_LPAREN:
		aout("(");
		break;
	case TOK_RPAREN:
		aout(")");
	}
}

static void
cg_var(void)
{

	aout("long %s;\n", token);
}

/*
 * Semantics.
 */

static void
symcheck(int check)
{
	struct symtab *curr, *ret = NULL;

	curr = head;
	while (curr != NULL) {
		if (!strcmp(token, curr->name))
			ret = curr;
		curr = curr->next;
	}

	if (ret == NULL)
		error("undefined symbol: %s", token);

	switch (check) {
	case CHECK_LHS:
		if (ret->type != TOK_VAR)
			error("must be a variable: %s", token);
		break;
	case CHECK_RHS:
		if (ret->type == TOK_PROCEDURE)
			error("must not be a procedure: %s", token);
		break;
	case CHECK_CALL:
		if (ret->type != TOK_PROCEDURE)
			error("must be a procedure: %s", token);
	}
}

/*
 * Parser.
 */

static void
next(void)
{

	type = lex();
	++raw;
}

static void
expect(int match)
{

	if (match != type)
		error("syntax error");

	next();
}

static void
addsymbol(int type)
{
	struct symtab *curr, *new;

	curr = head;
	while (1) {
		if (!strcmp(curr->name, token)) {
			if (curr->depth == (depth - 1))
				error("duplicate symbol: %s", token);
		}

		if (curr->next == NULL)
			break;

		curr = curr->next;
	}

	if ((new = malloc(sizeof(struct symtab))) == NULL)
		error("malloc failed");

	new->depth = depth - 1;
	new->type = type;
	if ((new->name = strdup(token)) == NULL)
		error("malloc failed");
	new->next = NULL;

	curr->next = new;
}

static void
destroysymbols(void)
{
	struct symtab *curr, *prev;

again:
	curr = head;
	while (curr->next != NULL) {
		prev = curr;
		curr = curr->next;
	}

	if (curr->type != TOK_PROCEDURE) {
		free(curr->name);
		free(curr);
		prev->next = NULL;
		goto again;
	}
}

static void
initsymtab(void)
{
	struct symtab *new;

	if ((new = malloc(sizeof(struct symtab))) == NULL)
		error("malloc failed");

	new->depth = 0;
	new->type = TOK_PROCEDURE;
	new->name = "main";
	new->next = NULL;

	head = new;
}

static void
factor(void)
{

	switch (type) {
	case TOK_IDENT:
		symcheck(CHECK_RHS);
		/* Fallthru */
	case TOK_NUMBER:
		cg_symbol();
		next();
		break;
	case TOK_LPAREN:
		cg_symbol();
		expect(TOK_LPAREN);
		expression();
		if (type == TOK_RPAREN)
			cg_symbol();
		expect(TOK_RPAREN);
	}
}

static void
term(void)
{

	factor();

	while (type == TOK_MULTIPLY || type == TOK_DIVIDE) {
		cg_symbol();
		next();
		factor();
	}
}

static void
expression(void)
{

	if (type == TOK_PLUS || type == TOK_MINUS) {
		cg_symbol();
		next();
	}

	term();

	while (type == TOK_PLUS || type == TOK_MINUS) {
		cg_symbol();
		next();
		term();
	}
}

static void
condition(void)
{

	if (type == TOK_ODD) {
		cg_symbol();
		expect(TOK_ODD);
		expression();
		cg_odd();
	} else {
		expression();

		switch (type) {
		case TOK_EQUAL:
		case TOK_HASH:
		case TOK_LESSTHAN:
		case TOK_GREATERTHAN:
			cg_symbol();
			next();
			break;
		default:
			error("invalid conditional");
		}

		expression();
	}
}

static void
statement(void)
{

	switch (type) {
	case TOK_IDENT:
		symcheck(CHECK_LHS);
		cg_symbol();
		expect(TOK_IDENT);
		if (type == TOK_ASSIGN)
			cg_symbol();
		expect(TOK_ASSIGN);
		expression();
		break;
	case TOK_CALL:
		expect(TOK_CALL);
		if (type == TOK_IDENT) {
			symcheck(CHECK_CALL);
			cg_call();
		}
		expect(TOK_IDENT);
		break;
	case TOK_BEGIN:
		cg_symbol();
		expect(TOK_BEGIN);
		statement();
		while (type == TOK_SEMICOLON) {
			cg_semicolon();
			expect(TOK_SEMICOLON);
			statement();
		}
		if (type == TOK_END)
			cg_symbol();
		expect(TOK_END);
		break;
	case TOK_IF:
		cg_symbol();
		expect(TOK_IF);
		condition();
		if (type == TOK_THEN)
			cg_symbol();
		expect(TOK_THEN);
		statement();
		break;
	case TOK_WHILE:
		cg_symbol();
		expect(TOK_WHILE);
		condition();
		if (type == TOK_DO)
			cg_symbol();
		expect(TOK_DO);
		statement();
		break;
	}
}

static void
block(void)
{

	if (depth++ > 1)
		error("nesting depth exceeded");

	if (type == TOK_CONST) {
		expect(TOK_CONST);
		if (type == TOK_IDENT) {
			addsymbol(TOK_CONST);
			cg_const();
		}
		expect(TOK_IDENT);
		expect(TOK_EQUAL);
		if (type == TOK_NUMBER) {
			cg_symbol();
			cg_semicolon();
		}
		expect(TOK_NUMBER);
		while (type == TOK_COMMA) {
			expect(TOK_COMMA);
			if (type == TOK_IDENT) {
				addsymbol(TOK_CONST);
				cg_const();
			}
			expect(TOK_IDENT);
			expect(TOK_EQUAL);
			if (type == TOK_NUMBER) {
				cg_symbol();
				cg_semicolon();
			}
			expect(TOK_NUMBER);
		}
		expect(TOK_SEMICOLON);
	}

	if (type == TOK_VAR) {
		expect(TOK_VAR);
		if (type == TOK_IDENT) {
			addsymbol(TOK_VAR);
			cg_var();
		}
		expect(TOK_IDENT);
		while (type == TOK_COMMA) {
			expect(TOK_COMMA);
			if (type == TOK_IDENT) {
				addsymbol(TOK_VAR);
				cg_var();
			}
			expect(TOK_IDENT);
		}
		expect(TOK_SEMICOLON);
		cg_crlf();
	}

	while (type == TOK_PROCEDURE) {
		proc = 1;

		expect(TOK_PROCEDURE);
		if (type == TOK_IDENT) {
			addsymbol(TOK_PROCEDURE);
			cg_procedure();
		}
		expect(TOK_IDENT);
		expect(TOK_SEMICOLON);

		block();

		expect(TOK_SEMICOLON);

		proc = 0;

		destroysymbols();
	}

	if (proc == 0)
		cg_procedure();

	statement();

	cg_epilogue();

	if (--depth < 0)
		error("nesting depth fell below 0");
}

static void
parse(void)
{

	next();
	block();
	expect(TOK_DOT);

	if (type != 0)
		error("extra tokens at end of file");

	cg_end();
}

/*
 * Main.
 */

int
main(int argc, char *argv[])
{

	if (argc != 2) {
		(void) fputs("usage: pl0c file.pl0\n", stderr);
		exit(1);
	}

	readin(argv[1]);

	initsymtab();

	parse();

	return 0;
}

On the next episode: Input and output

As I've said a few times now, we have no input or output. The grammar didn't provide that for us. We'll do it ourselves. This will require us to extend all three parts of the compiler and as such is a good learning exercise for extending compilers.

Brian Robert Callahan

[prev]

[next]

2021-08-18
Let's write a compiler, part 5: A code generator

Code organization

Writing to stdout

Our first code generation

The workflow of writing a code generator

Generating code for constants

A symbol table

Really finishing constant sections

Code generation for variable sections

Code generation for procedures

Code generation for statements

Left-hand side versus right-hand side

Returning to code generation for statements

Code generation for conditions

Code generation for expressions

Code generation for terms

Code generation for factors

Why this works

All together now

On the next episode: Input and output

Brian Robert Callahan

[prev]

[next]

2021-08-18Let's write a compiler, part 5: A code generator

Code organization

Writing to stdout

Our first code generation

The workflow of writing a code generator

Generating code for constants

A symbol table

Really finishing constant sections

Code generation for variable sections

Code generation for procedures

Code generation for statements

Left-hand side versus right-hand side

Returning to code generation for statements

Code generation for conditions

Code generation for expressions

Code generation for terms

Code generation for factors

Why this works

All together now

On the next episode: Input and output

2021-08-18
Let's write a compiler, part 5: A code generator