Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life



[prev]
[next]

2021-08-26
Let's write a compiler, part 8: Strings, forward references, and conclusion

All source code for this blog post can be found here.

This is it. After today, we will have a completed PL/0 compiler that should happily compile any PL/0 code in our dialect. Today, we will wrap things up by supporting string literals and forward declarations of procedures. Supporting string literals will greatly reduce our programming effort for the self-hosting compiler. Forward declarations will allow us to solve the problem of procedures that are mutually recursive, which we will need to be able to write our self-hosting compiler. After we are finished writing code for today, we'll reflect on what we've managed to achieve and set our sights for the next blog series on taking this compiler and using it as a bootstrap compiler for our soon-to-be new goal of writing a self-hosting compiler.

String literals

String literals will be used only in a single context in this compiler: writeStr 'This is a string.\n'. We could forgo string literals entirely for using arrays, such as:

array[0] = 'T';
array[1] = 'h';
array[2] = 'i';
array[3] = 's';
array[4] = ' ';
array[5] = 'i';
array[6] = 's';
array[7] = ' ';
array[8] = 'a';
array[9] = ' ';
array[10] = 's';
array[11] = 't';
array[12] = 'r';
array[13] = 'i';
array[14] = 'n';
array[15] = 'g';
array[16] = '.';
array[17] = '\n';
writeStr array

But as we can see, that's a lot more work for read-only strings. So let's implement string literals. We'll defer the question of assigning strings directly to arrays for our self-hosting compiler later.

We'll start by #define TOK_STRING '"' in our list of lexer token types. We then need to teach our lexer about string literals. But first, we need to teach ourselves about strings in Pascal.

Strings in Pascal

Unlike C, strings in Pascal are slighly more complicated. We're going to say that string literals are ASCII characters encased in single quotes, ' ... '. If we need to write a single quote character itself, you write two single quotes next to each other, ''. You can write escape characters as in C, for example, \n for a newline and \0 for a NUL-byte.

Lexing strings

The lexer needs to be able to spit out the C version of a Pascal string. That way we can have the code generator drop the C string into the output source code without needing to do any additional work. I also overcomplicated things in the lexer by detecting if a string is of length 1 and converting those into C character literals. This allows us to do things like writeChar '\n' at the expense of not being able to do writeStr '\n'. I think that's fine and we can defer further nuance to the self-hosting compiler.

Here is what the string processing routine looks like:

static int
string(void)
{
	char *p;
	size_t i, len, mod = 0;

	p = ++raw;

restart:
	while (*raw != '\'') {
		if (*raw == '\n' || *raw == '\0')
			error("unterminated string");
		if (*raw++ == '"')
			++mod;
	}
	if (*++raw == '\'') {
		++raw;
		goto restart;
	}

	--raw;

	len = raw - p + mod;

	if (len < 1)
		error("impossible string");

	free(token);

	if ((token = malloc(len + 3)) == NULL)
		error("malloc failed");

	token[0] = '"';
	for (i = 1; i <= len; i++) {
		if (*p == '\'') {
			token[i++] = '\\';
			++p;
		} else if (*p == '"') {
			token[i++] = '\\';
		}
		token[i] = *p++;
	}
	token[i++] = '"';
	token[i] = '\0';

	if (len == 1 || (len == 2 && token[1] == '\\')) {
		token[0] = '\'';
		if (len == 1)
			token[2] = '\'';
		else
			token[3] = '\'';

		return TOK_NUMBER;
	}

	return TOK_STRING;
}

That's a lot. And I definitely overcomplicated things by turning single-character strings into character literals. Some notes:

  1. If we have a double quote character, ", we need to elongate the length of the C string by 1 because in a C string, the double quote character needs to be escaped, \", whereas in Pascal strings it does not.
  2. We need to check for two single quotes in a row, '', and not terminate string processing if we do.

Now, in the parser, specifically in statement processing, we need to teach the writeStr statement to do the right thing if its next token is a TOK_STRING:

	case TOK_WRITESTR:
		expect(TOK_WRITESTR);
		if (type == TOK_IDENT || type == TOK_STRING) {
			if (type == TOK_IDENT)
				symcheck(CHECK_LHS);
			cg_writestr();

			if (type == TOK_IDENT)
				expect(TOK_IDENT);
			else
				expect(TOK_STRING);
		} else {
			error("writeStr takes an array or a string");
		}

		break;

Lastly, we need to teach the code generator to handle string literals correctly:

static void
cg_writestr(void)
{
	struct symtab *curr, *ret;

	if (type == TOK_IDENT) {
		curr = head;
		while (curr != NULL) {
			if (!strcmp(curr->name, token))
				ret = curr;
			curr = curr->next;
		}

		if (ret == NULL)
			error("undefined symbol: %s", token);

		if (ret->size == 0)
			error("writeStr requires an array");

		aout("__writestridx = 0;\n");
		aout("while(%s[__writestridx]!='\\0'&&__writestridx<%ld)\n",
		    token, ret->size);
		aout("(void)fputc((unsigned char)%s[__writestridx++],stdout);\n",
		    token);
	} else {
		aout("(void)fprintf(stdout, %s);\n", token);
	}
}

Forward references

In our recursive descent parser, we needed a forward declaration in order to use expression() in factor(). We are likely going to have the same general structure for our recursive descent parser written in PL/0 in our self-hosting compiler, so let's take the time now to support forward declarations. As always, we'll start with #define TOK_FORWARD 'F' to add a new token type for the forward reserved word. Next, we need to teach ident() about this new reserved word.

Once we've done that, we need to figure out where in our grammar forward references should live. I am going to choose to have a new section, between the variable section and the procedure section, that will be the forward declaration section. Forward declaration syntax will look like this:

forward proc1;
forward proc2;

That is to say, each forward declaration will be a procedure name preceeded by the forward reserved word and followed by a semicolon. No forward proc1, proc2; syntax will be allowed. In code, we will put this in the parser in block() in between the variable declaration section and the procedure section. It looks like this:

	while (type == TOK_FORWARD) {
		expect(TOK_FORWARD);
		if (type == TOK_IDENT) {
			addsymbol(TOK_FORWARD);
			cg_forward();
		}
		expect(TOK_IDENT);
		expect(TOK_SEMICOLON);
	}

Now we need to figure out what to do with these forward declarations so that they are usable to us. Very clearly, we are adding the symbol to the symbol table, as we have addsymbol(TOK_FORWARD) in our forward declaration section code. But! We need to do something different than we have done to this point. While we are in the forward delcaration section, we will simply record forward declarations as identifier names of type TOK_FORWARD. The different thing we need to do happens when we later reach the actual procedure with this name. Because as things stand now, the compiler will report an error since it now will have two symbols of the same name at the same depth level, 0. So we will employ a little trick to deal with this: if the token type of the symbol we are currently adding is a TOK_PROCEDURE and the current symbol we're checking in the symbol table has the same name but is of type TOK_FORWARD, then we will ignore the multiple definitions and just keep going. In code, that looks like this:

static void
addsymbol(int type)
{
	struct symtab *curr, *new;

	curr = head;
	while (1) {
		if (!strcmp(curr->name, token)) {
			if (curr->depth == (depth - 1)) {
				if (curr->type == TOK_FORWARD &&
				    type == TOK_PROCEDURE) {
					/* Don't error out!  */
				} else {
					error("duplicate symbol: %s", token);
				}
			}
		}

		if (curr->next == NULL)
			break;

		curr = curr->next;
	}

	if ((new = malloc(sizeof(struct symtab))) == NULL)
		error("malloc failed");

	new->depth = depth - 1;
	new->type = type;
	new->size = 0;
	if ((new->name = strdup(token)) == NULL)
		error("malloc failed");
	new->next = NULL;

	curr->next = new;
}

So yes, the forward references will stick around in the symbol table for the whole of the compilation process, but that's fine with me.

Now, to be able to reference forward declared procedures before you reach the procedure itself, we modify symcheck and change every place with TOK_PROCEDURE with TOK_PROCEDURE || TOK_FORWARD:

static void
symcheck(int check)
{
	struct symtab *curr, *ret = NULL;

	curr = head;
	while (curr != NULL) {
		if (!strcmp(token, curr->name))
			ret = curr;
		curr = curr->next;
	}

	if (ret == NULL)
		error("undefined symbol: %s", token);

	switch (check) {
	case CHECK_LHS:
		if (ret->type != TOK_VAR)
			error("must be a variable: %s", token);
		break;
	case CHECK_RHS:
		if (ret->type == TOK_PROCEDURE || ret->type == TOK_FORWARD)
			error("must not be a procedure: %s", token);
		break;
	case CHECK_CALL:
		if (ret->type != TOK_PROCEDURE && ret->type != TOK_FORWARD)
			error("must be a procedure: %s", token);
	}
}

And finally, we need to add a new function to the code generator to output a C translation of forward declarations:

static void
cg_forward(void)
{

	aout("static void %s(void);", token);
}

Conclusion

Modulo inevitable bugs and desires of others to include this or that extra item, this compiler is finished. We have enough here to write a self-hosting PL/0 compiler. And for those who choose not to join us on that journey, you have a good enough compiler to write significant programs (at least, I'm considering a compiler a significant program and I know you can write one of those with this compiler). And if you won't be joining us on the self-hosting compiler journey, you are in a good position to add all the additional features to this compiler that you'd like.

If at any time you'd like to review our finished compiler, you may do so here.

Thanks for coming with me on this journey. I hope you've come to see that compilers do not have to be scary black boxes. Like everything else, the work can be broken down into small, discrete units of work that can be tackled one by one, slowly building up to a completed compiler. And we saw that some pieces of the compiler, taken standalone, can be used to build useful utilities of their own: such as a lexer being the central piece of a code formatter and the front end as a whole being able to be used as a code validator.

There are still plenty of unique challenges for us ahead. It is time to take the lessons learned in this series and turn our attention to a new series writing a self-hosting compiler.

Top

RSS