Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life



[prev]
[next]

2021-08-22
Let's write a compiler, part 7: Arrays

All source code for this blog post can be found here.

Arrays are an important feature for our language and compiler. Not defined in the original PL/0 specification but later added in its successor language, Oberon-0, arrays allow us to carve out pre-defined regions of memory of any arbitrary size to be used as we see fit. Because the original PL/0 language has no concept of arrays, we are free to define our own syntax for them.

These are our challenges for today: choosing a syntax for arrays, implementing that syntax, and allowing array identifiers to be used just the same as the TOK_VAR identifiers we already have.

A syntax for declaring and referencing arrays

We can make up any syntax we want for arrays. In C, arrays are declared with the int array[size]; syntax, where size is some value that makes sense. In Pascal, array declarations are slightly more complex. Both use an array[expression] syntax for referencing arrays. I like that, let's keep it. I'll also keep C-style arrays: that is, the first index will be 0 and the last index will be size - 1. I don't think that's ideal. I think the traditional Pascal index numbering beginning at 1 "fits" better with PL/0 but it's needless extra work in our code generator that, if we get it wrong, could have baffling runtime consequences.

So that leaves us with declaration syntax. I am going to pick something very different that might upset some people: var array size <number>, where number is some number greater than zero. I want to keep arrays declared in the variable section, to allow mixing and matching arrays and regular variables. I also wanted to do something different for declaration syntax. Arrays will have sizes, regular variables won't. So for example, var a size 10, b; will give you a 10-cell array named a and a regular variable named b. You can reference the cells using a[0] ... a[9]. We'll have to make sure the size is greater than zero.

Teaching our lexer some new token

First thing we'll need to do as usual is to teach the lexer about some new tokens. We'll need three: #define TOK_SIZE 's', #define TOK_LBRACK '[', and #define TOK_RBRACK ']'. I trust that this point you know how to add a check for the reserved word size in ident(). The brackets should just return their characters like all the other single-character tokens.

Teaching our parser about arrays

Now we need to have the parser recognize arrays. Both declarations and references. For declarations, we will need to modify the variable section portion of the block() function. After the identifier, if there is a size reserved word immediately after, we are dealing with an array. We need to consume the size token and then we had better have a number immediately after that. We'll need to do some sanity checking that the size is at least 1. And we can stub out the code generator portion for now as well. It looks like this:

	if (type == TOK_VAR) {
		expect(TOK_VAR);
		if (type == TOK_IDENT) {
			addsymbol(TOK_VAR);
			cg_var();
		}
		expect(TOK_IDENT);
		if (type == TOK_SIZE) {
			expect(TOK_SIZE);
			if (type == TOK_NUMBER) {
				arraysize();
				cg_array();
			}
			expect(TOK_NUMBER);
		}
		cg_semicolon();
		while (type == TOK_COMMA) {
			expect(TOK_COMMA);
			if (type == TOK_IDENT) {
				addsymbol(TOK_VAR);
				cg_var();
			}
			expect(TOK_IDENT);
			if (type == TOK_SIZE) {
				expect(TOK_SIZE);
				if (type == TOK_NUMBER) {
					arraysize();
					cg_array();
					expect(TOK_NUMBER);
				}
			}
			cg_semicolon();
		}
		expect(TOK_SEMICOLON);
		cg_crlf();
	}

For array references, we'll need to modify a couple of places. First is the left-hand side of assignment expressions. That lives in statement() and we'll modify the TOK_IDENT part to handle the possibility of arrays:

	case TOK_IDENT:
		symcheck(CHECK_LHS);
		cg_symbol();
		expect(TOK_IDENT);
		if (type == TOK_LBRACK) {
			arraycheck();
			cg_symbol();
			expect(TOK_LBRACK);
			expression();
			if (type == TOK_RBRACK)
				cg_symbol();
			expect(TOK_RBRACK);
		}
		if (type == TOK_ASSIGN)
			cg_symbol();
		expect(TOK_ASSIGN);
		expression();
		break;

For the right-hand side of assignment expressions, that lives in factor() and is exactly the same as we did in statement():

	case TOK_IDENT:
		symcheck(CHECK_RHS);
		cg_symbol();
		expect(TOK_IDENT);
		if (type == TOK_LBRACK) {
			arraycheck();
			cg_symbol();
			expect(TOK_LBRACK);
			expression();
			if (type == TOK_RBRACK)
				cg_symbol();
			expect(TOK_RBRACK);
		}
		break;

All we need to do for both is check if there is a left bracket after an identifier. If there is, we have an array. If not, it's a regular variable.

That's it for syntax modifications. Next we should turn to semantic checking of arrays, since they have some interesting challenges.

Semantic checking for arrays

You may have noticed some new non-code generator functions pop up in our new code. These functions handle semantic checking for arrays. There are two things we need to do: during declaration we need to record the size of the array. During reference, we need to make sure an array makes sense where it appears.

Let's handle them in that order. First then will be arraysize() which records the size of the array you've declared.

We need to do the following: go to the very end of the symbol table, since we have already added the symbol table entry at this point. If somehow this isn't a TOK_VAR, error out. Then, sanity check the number we've just read in from the source code. If it's valid, set the size member of the symbol table entry to that value.

We'll need to expand what a symbol table entry is a little:

struct symtab {
	int depth;
	int type;
	long size;
	char *name;
	struct symtab *next;
};

We've added the size member. Everything that isn't an array must have a size of 0.

Now then, we can see the new arraysize() function:

static void
arraysize(void)
{
	struct symtab *curr;
	const char *errstr;

	curr = head;
	while (curr->next != NULL)
		curr = curr->next;

	if (curr->type != TOK_VAR)
		error("arrays must be declared with \"var\"");

	curr->size = strtonum(token, 1, LONG_MAX, &errstr);
	if (errstr != NULL)
		error("invalid array size");
}

Now we can move on to the arraycheck() function, which makes sure that the identifier referenced is actually an array (as opposed to a regular variable). Iterate through the symbol table, that the bottom-most entry with the symbol name you have, and check to see if the size for that entry isn't 0:

static void
arraycheck(void)
{
	struct symtab *curr, *ret = NULL;

	curr = head;
	while (curr != NULL) {
		if (!strcmp(token, curr->name))
			ret = curr;
		curr = curr->next;
	}

	if (ret == NULL)
		error("undefined symbol: %s", token);

	if (ret->size == 0)
		error("symbol %s is not an array", token);
}

Amazingly, that's all we need for semantic checking. All that's left is code generation.

Teaching the code generator about arrays

Like everything else so far, we have two new things to teach the code generator: how to generate code for an array declaration and how to generate code for an array reference. Array declaration is easy: wrap the number that comes after the size reserved word in square brackets. This is done immediately after the arraysize() check passes:

static void
cg_array(void)
{

	aout("[%s]", token);
}

Array references were already completed when we taught the parser about left-hand side and right-hand side array references. All we need to do is write out a left bracket, write out an expression which is already handled by expression() and finally write out a right bracket.

Writing arrays with writeStr

We can also add a new statement, a writeStr statement, to write out arrays without needing the tedious manual iterating over the array. At least not in PL/0 code, anyway. We'll have the code generator (and, therefore, the C compiler) deal with the tedium. Eventually we will implement string literals as well. Add TOK_WRITESTR and TOK_STRING lexer token types, then we can tack on this new case statement in statement():

	case TOK_WRITESTR:
		expect(TOK_WRITESTR);
		if (type == TOK_IDENT || type == TOK_STRING) {
			if (type == TOK_IDENT)
				symcheck(CHECK_LHS);
			cg_writestr();

			if (type == TOK_IDENT)
				expect(TOK_IDENT);
			else
				expect(TOK_STRING);
		} else {
			error("writeStr takes an array or a string");
		}

		break;

That also requires a new cg_writestr() code generator function:

static void
cg_writestr(void)
{
	struct symtab *curr, *ret;

	if (type == TOK_IDENT) {
		curr = head;
		while (curr != NULL) {
			if (!strcmp(curr->name, token))
				ret = curr;
			curr = curr->next;
		}

		if (ret == NULL)
			error("undefined symbol: %s", token);

		if (ret->size == 0)
			error("writeStr requires an array");

		aout("__writestridx = 0;\n");
		aout("while(%s[__writestridx]!='\\0'&&__writestridx<%ld)\n",
		    token, ret->size);
		aout("(void)fputc((unsigned char)%s[__writestridx++],stdout);\n",
		    token);
	} else {
		aout("(void)fprintf(stdout, %s);\n", token);
	}
}

This one's a little tricky because our PL/0 arrays are not C strings. Every cell in every array is a long, since everything in our PL/0 dialect is a long. After checking that we have an array, we iterate over the array until we hit a NUL-byte and print everything we find along the way. We can also leverage the fact that we know how big the array is because we store that information in the symbol table. We will prevent out-of-bounds writes by adding a new global variable, __writestridx and using it as a size counter against the array size. If the index variable goes above the size of the array, we also stop writing. You'll need to add this __writestridx variable in cg_init() alongside __stdin[24] and *__errstr.

Efficiency and "wasted space"

Because everything in our PL/0 dialect is a long, that means that storing strings uses a whole lot more memory than is strictly necessary. I'm not bothered by that, and you shouldn't be either. Like we've been saying, we are valuing correctness over everything else. What we have here will perform the correct behavior even it takes more memory to do so. That's our goal in this series: writing a compiler that does the correct thing. If afterwards you decide to implement a type system, enjoy. The space isn't wasted if we're learning and improving our skills, and thinking about what the next steps towards improvement are.

On the next episode: Strings, forward references, and conclusion

The final features I want to add to our language and compiler is string handling and the ability to forward-reference procedures, to allow for mutually recursive procedures. After that, we'll have a featureful enough language and compiler to be able to write a self-hosted PL/0 compiler using our current compiler as the bootstrap compiler.

And for those of you not coming with us on the self-hosted compiler adventure, I'd like to leave you with a rich enough compiler for you to write complex programs.

By the way, for those of you following with the GitHub repository, you'll notice that a fully self-hosting compiler is already there. Once this series is finished, we'll start the writing a self-hosting compiler series.

To see where all the code we wrote today fits into our current compiler, I invite you to check out original.c in the repo as well. It's getting a bit too unwieldy to keep copying and pasting the full source code in the blog.

Looking forward to one last episode in this first series, and looking forward to having a complete PL/0 compiler!

Top

RSS