Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life



[prev]
[next]

2021-08-16
Let's write a compiler, part 3: A parser

All source code for this blog post can be found here.

Tokenizing the source code input is a great start for our compiler. Now let's turn our attention to parsing the PL/0 grammar. Once we have finished with the parser, the front end of our compiler will be finished and we will have a PL/0 validator. This is a program that can read in PL/0 source code and inform the user whether or not it is valid PL/0 code.

There are many strategies available for writing a parser. And lexers too, for that matter. While we wrote a lexer by hand because it's a useful thing to do at least once, you could have used the flex tool instead to generate a lexer for you. Much the same, for a simple grammar like that found in PL/0, you could use a tool such as yacc to take the PL/0 grammar and output a parser for us.

But since the point of this series is to learn, we will write a hand-crafted recursive-descent parser. Among other features, a recursive-descent parser is a top-down parser. That means we will start with the very first token in the PL/0 source code and continue in order until we do something meaningful with the last token in the PL/0 source code.

Some helper functions

Before we get to writing the parser itself, we are going to need two helper functions, which I am going to name next() and expect(). The next() function is how the parser will control the lexer, as that is how the parser can tell the lexer to fetch the next token from the PL/0 source code and record all of the useful information for that token: the token type and, if needed, the token string. This is exactly what our lexer is already doing, so we don't need to make any changes to the lexer at all. We just need to hook it up to the parser.

The expect() function is how we will enforce proper syntax. The function will take one parameter, a token type, and it will check that the currently recorded token type matches the token type we expect to have at this point in the grammar. If it does not, we will report a syntax error through our already existing error mechanism. If it does match, have the lexer fetch the next token.

The code for these two helper functions might look like this:

/*
 * Parser.
 */

static void
next(void)
{

	type = lex();
	++raw;
}

static void
expect(int match)
{

	if (match != type)
		error("syntax error");

	next();
}

Reading the PL/0 grammar

We will need to be able to read the Extended Backus-Naur Form (EBNF) style grammar that is that long grammar comment in our compiler source code in order to be able to write our parser. Without going into lots of formal theory or detail, let's keep in mind the following rules for reading the grammar: