Brian Robert Callahan
academic, developer, with an eye towards a brighter techno-social life
All source code for this blog post can be found here.
Tokenizing the source code input is a great start for our compiler. Now let's turn our attention to parsing the PL/0 grammar. Once we have finished with the parser, the front end of our compiler will be finished and we will have a PL/0 validator. This is a program that can read in PL/0 source code and inform the user whether or not it is valid PL/0 code.
There are many strategies available for writing a parser. And lexers too, for that matter. While we wrote a lexer by hand because it's a useful thing to do at least once, you could have used the flex
tool instead to generate a lexer for you. Much the same, for a simple grammar like that found in PL/0, you could use a tool such as yacc
to take the PL/0 grammar and output a parser for us.
But since the point of this series is to learn, we will write a hand-crafted recursive-descent parser. Among other features, a recursive-descent parser is a top-down parser. That means we will start with the very first token in the PL/0 source code and continue in order until we do something meaningful with the last token in the PL/0 source code.
Before we get to writing the parser itself, we are going to need two helper functions, which I am going to name next()
and expect()
. The next()
function is how the parser will control the lexer, as that is how the parser can tell the lexer to fetch the next token from the PL/0 source code and record all of the useful information for that token: the token type and, if needed, the token string. This is exactly what our lexer is already doing, so we don't need to make any changes to the lexer at all. We just need to hook it up to the parser.
The expect()
function is how we will enforce proper syntax. The function will take one parameter, a token type, and it will check that the currently recorded token type matches the token type we expect to have at this point in the grammar. If it does not, we will report a syntax error through our already existing error mechanism. If it does match, have the lexer fetch the next token.
The code for these two helper functions might look like this:
/* * Parser. */ static void next(void) { type = lex(); ++raw; } static void expect(int match) { if (match != type) error("syntax error"); next(); }
We will need to be able to read the Extended Backus-Naur Form (EBNF) style grammar that is that long grammar comment in our compiler source code in order to be able to write our parser. Without going into lots of formal theory or detail, let's keep in mind the following rules for reading the grammar:
=
sign will become a function in our parser.|
means "or." For example, a factor
can be an identifier or a number or an expression wrapped in parentheses.TOK_CONST
token.A-Za-z_
character and is followed by zero or more A-Za-z0-9_
characters. A number begins with a 0-9
character and contains one or more 0-9
characters with optional _
characters for human readability. The reserved words are all lowercase.Let's start at the top. The very first rule in our grammar is program
. The program rule is defined to be a block rule followed by a literal dot. Let's write that out in code:
static void parse(void) { next(); block(); expect(TOK_DOT); if (type != 0) error("extra tokens at end of file"); }
OK, so I slightly violated my rule of naming the function after the rule. This will be the only time I do this. This is because parse()
was the name of the function that held the lexer printing logic from the previous entry and I've simply replaced that logic with this new logic that begins the compilation process.
Let's see what we have here. The first thing we do is call next()
because if we do not, then the lexer will not fetch a token for us and we won't have anything to work with. Next we have what looks extremely similar to the rule itself: a block function followed by us enforcing a literal dot. Because the expect()
function fetches the next token automatically, we can also check that no other tokens exist after that literal dot. If there is, that's an error because you have code after the point in which the grammar says you cannot have any more code. You can have whitespace and comments after the final literal dot, because our lexer does not return on comments as comments are not tokens. The lexer will instead loop again, reach the end of the input, and return 0, signifying the end of the input. And thus we are OK for this special case of having comments after the final dot.
Now all we have to do is descend down into the functions that are as-of-now undefined, which is the block()
function. Which just so happens to correspond to the next rule in the PL/0 grammar, the block rule.
The block rule has several optional sections. The first is the const section. As it is wrapped in square brackets, that means you can have exactly zero or exactly one const sections. We can turn this into code by following the same rules we did for writing the previous function, except we will wrap it inside an if
statement that checks to see if the first token is a TOK_CONST
token. If it is, we will execute everything in the block for exactly one const section. If it is not, we will skip the entire block for exactly zero const sections.
This opening part of the block()
function might look like this:
static void block(void) { if (type == TOK_CONST) { expect(TOK_CONST); expect(TOK_IDENT); expect(TOK_EQUAL); expect(TOK_NUMBER); while (type == TOK_COMMA) { expect(TOK_COMMA); expect(TOK_IDENT); expect(TOK_EQUAL); expect(TOK_NUMBER); } expect(TOK_SEMICOLON); }
You might have already figured out how to deal with the curly brace wrapped parts of the grammar reading through the code above. Curly braces mean zero or more, with no upper limit. And so handling curly braces is exactly the same as dealing with square brackets, except the if
for square brackets becomes a while
for curly braces which allows looping and therefore any number of arbitrary iterations or zero iterations.
We can follow these rules to finish the block()
function:
static void block(void) { if (type == TOK_CONST) { expect(TOK_CONST); expect(TOK_IDENT); expect(TOK_EQUAL); expect(TOK_NUMBER); while (type == TOK_COMMA) { expect(TOK_COMMA); expect(TOK_IDENT); expect(TOK_EQUAL); expect(TOK_NUMBER); } expect(TOK_SEMICOLON); } if (type == TOK_VAR) { expect(TOK_VAR); expect(TOK_IDENT); while (type == TOK_COMMA) { expect(TOK_COMMA); expect(TOK_IDENT); } expect(TOK_SEMICOLON); } while (type == TOK_PROCEDURE) { expect(TOK_PROCEDURE); expect(TOK_IDENT); expect(TOK_SEMICOLON); block(); expect(TOK_SEMICOLON); } statement(); }
Both the const and var sections are in square brackets, while the procedure section is in curly braces. The statement section is not inside anything, so it is not optional. You might also notice that a procedure is little more than a named block, and it recursively calls block()
to do its work. This means that you can have infinitely nested procedures as per the grammar. That's a common feature of Pascal-like languages. But I am going to do a little bit of future planning here and forbid nested procedures. You may have as many procedures as you want, they just can't be nested inside each other. If I remember correctly, Pascal compilers enforced a limit on how deep the nested procedures could go (though I'm not sure if any banned nested procedures outright, someone on Hacker News will let us know, I'm sure).
We will not alter the grammar in order to achieve this forbidding of nested procedures, however. I want us to implement a parser for the PL/0 grammar exactly as the grammar is written. We will instead use some metadata to enforce no nested procedures. Let's create a new global variable named depth
and we will increment the depth level at the beginning of each block and decrement the depth level at the end of each block. We can use this to check that we don't have nested procedures.
That makes our finished block()
function look like this:
static void block(void) { if (depth++ > 1) error("nesting depth exceeded"); if (type == TOK_CONST) { expect(TOK_CONST); expect(TOK_IDENT); expect(TOK_EQUAL); expect(TOK_NUMBER); while (type == TOK_COMMA) { expect(TOK_COMMA); expect(TOK_IDENT); expect(TOK_EQUAL); expect(TOK_NUMBER); } expect(TOK_SEMICOLON); } if (type == TOK_VAR) { expect(TOK_VAR); expect(TOK_IDENT); while (type == TOK_COMMA) { expect(TOK_COMMA); expect(TOK_IDENT); } expect(TOK_SEMICOLON); } while (type == TOK_PROCEDURE) { expect(TOK_PROCEDURE); expect(TOK_IDENT); expect(TOK_SEMICOLON); block(); expect(TOK_SEMICOLON); } statement(); if (--depth < 0) error("nesting depth fell below 0"); }
Additionally, you might notice that while procedures have names, they do not accept any parameters nor do they return any values. They are equivalent to a void
function in C that accepts no parameters, i.e., void procedure(void)
. Keep that in mind when you are writing PL/0 code to give to our PL/0 compiler.
Are there any functions in the block()
function that we have not yet implemented? We are going to keep asking this question until every grammar rule has its own function. The answer is yes, just one: statement()
. Lucky us, the statement rule is the next rule in the grammar.
The statement rule has a number of possibilities. Each possibility is separated by a |
, which means "or." To figure out what to do, we must look at the first token in each possibility. What we are looking for is that every possibility begins with a different token. If that were not the case, we might have to rewrite the affected possibilities until we reached a point where each possibility begins with a different token. We are lucky, or possibly Dr. Wirth did it by design (it was done by design), because each possibility begins with a different token. That lends itself well to a switch
statement. Handling |
here means a switch statement on the first token with each possibility being its own case.
That might look like this:
static void statement(void) { switch (type) { case TOK_IDENT: expect(TOK_IDENT); expect(TOK_ASSIGN); expression(); break; case TOK_CALL: expect(TOK_CALL); expect(TOK_IDENT); break; case TOK_BEGIN: expect(TOK_BEGIN); statement(); while (type == TOK_SEMICOLON) { expect(TOK_SEMICOLON); statement(); } expect(TOK_END); break; case TOK_IF: expect(TOK_IF); condition(); expect(TOK_THEN); statement(); break; case TOK_WHILE: expect(TOK_WHILE); condition(); expect(TOK_DO); statement(); } }
All we need to do is follow the logic we already learned implementing the program and block rules to implement the statement rule, with each possibility living in its own case.
Note that there is no default
case. This is to leave open the possibility for the empty statement.
Let's ask the question again: are there any undefined functions in our grammar? Yes, there are two. Both of them were first used in the statement()
function: expression()
and condition()
. And sure enough, there is a grammar rule named expression and a grammar rule named condition. The condition rule is next in the grammar, so let's do that one first.
The condition rule has two possibilities, and, just like the statement rule, each possibility begins with a different token. Since there are only two possibilities, a switch statement is unnecessary. Also, because the expression rule has many different possibilities for a starting token, though none of those possibilities is a TOK_ODD
token, let's check for a TOK_ODD
token and if it's not, it must be an expression.
That might look like this:
static void condition(void) { if (type == TOK_ODD) { expect(TOK_ODD); expression(); } else { expression(); switch (type) { case TOK_EQUAL: case TOK_HASH: case TOK_LESSTHAN: case TOK_GREATERTHAN: next(); break; default: error("invalid conditional"); } expression(); } }
No new functions. Just the expression()
function that we have yet to implement. So let's implement it. Implementing the expression rule will trigger needing to write the term and factor rules as well, which are all the rules we have left to implement. Let's finish the job:
static void factor(void) { switch (type) { case TOK_IDENT: case TOK_NUMBER: next(); break; case TOK_LPAREN: expect(TOK_LPAREN); expression(); expect(TOK_RPAREN); } } static void term(void) { factor(); while (type == TOK_MULTIPLY || type == TOK_DIVIDE) { next(); factor(); } } static void expression(void) { if (type == TOK_PLUS || type == TOK_MINUS) next(); term(); while (type == TOK_PLUS || type == TOK_MINUS) { next(); term(); } }
There are no undefined functions to implement in our code, and all the grammar rules have a corresponding function in our compiler, so our parser is finished. And now with both our lexer and our parser finished, we have a completed front end for our PL/0 compiler. We should be able to use our front end as a validator, as it should accept all valid PL/0 code and reject all invalid PL/0 code.
#include <sys/stat.h> #include <ctype.h> #include <fcntl.h> #include <limits.h> #include <stdarg.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #define TOK_IDENT 'I' #define TOK_NUMBER 'N' #define TOK_CONST 'C' #define TOK_VAR 'V' #define TOK_PROCEDURE 'P' #define TOK_CALL 'c' #define TOK_BEGIN 'B' #define TOK_END 'E' #define TOK_IF 'i' #define TOK_THEN 'T' #define TOK_WHILE 'W' #define TOK_DO 'D' #define TOK_ODD 'O' #define TOK_DOT '.' #define TOK_EQUAL '=' #define TOK_COMMA ',' #define TOK_SEMICOLON ';' #define TOK_ASSIGN ':' #define TOK_HASH '#' #define TOK_LESSTHAN '<' #define TOK_GREATERTHAN '>' #define TOK_PLUS '+' #define TOK_MINUS '-' #define TOK_MULTIPLY '*' #define TOK_DIVIDE '/' #define TOK_LPAREN '(' #define TOK_RPAREN ')' /* * pl0c -- PL/0 compiler. * * program = block "." . * block = [ "const" ident "=" number { "," ident "=" number } ";" ] * [ "var" ident { "," ident } ";" ] * { "procedure" ident ";" block ";" } statement . * statement = [ ident ":=" expression * | "call" ident * | "begin" statement { ";" statement } "end" * | "if" condition "then" statement * | "while" condition "do" statement ] . * condition = "odd" expression * | expression ( "=" | "#" | "<" | ">" ) expression . * expression = [ "+" | "-" ] term { ( "+" | "-" ) term } . * term = factor { ( "*" | "/" ) factor } . * factor = ident * | number * | "(" expression ")" . */ static char *raw, *token; static int depth, type; static size_t line = 1; static void expression(void); /* * Misc. functions. */ static void error(const char *fmt, ...) { va_list ap; (void) fprintf(stderr, "pl0c: error: %lu: ", line); va_start(ap, fmt); (void) vfprintf(stderr, fmt, ap); va_end(ap); (void) fputc('\n', stderr); exit(1); } static void readin(char *file) { int fd; struct stat st; if (strrchr(file, '.') == NULL) error("file must end in '.pl0'"); if (!!strcmp(strrchr(file, '.'), ".pl0")) error("file must end in '.pl0'"); if ((fd = open(file, O_RDONLY)) == -1) error("couldn't open %s", file); if (fstat(fd, &st) == -1) error("couldn't get file size"); if ((raw = malloc(st.st_size + 1)) == NULL) error("malloc failed"); if (read(fd, raw, st.st_size) != st.st_size) error("couldn't read %s", file); raw[st.st_size] = '\0'; (void) close(fd); } /* * Lexer. */ static void comment(void) { int ch; while ((ch = *raw++) != '}') { if (ch == '\0') error("unterminated comment"); if (ch == '\n') ++line; } } static int ident(void) { char *p; size_t i, len; p = raw; while (isalnum(*raw) || *raw == '_') ++raw; len = raw - p; --raw; free(token); if ((token = malloc(len + 1)) == NULL) error("malloc failed"); for (i = 0; i < len; i++) token[i] = *p++; token[i] = '\0'; if (!strcmp(token, "const")) return TOK_CONST; else if (!strcmp(token, "var")) return TOK_VAR; else if (!strcmp(token, "procedure")) return TOK_PROCEDURE; else if (!strcmp(token, "call")) return TOK_CALL; else if (!strcmp(token, "begin")) return TOK_BEGIN; else if (!strcmp(token, "end")) return TOK_END; else if (!strcmp(token, "if")) return TOK_IF; else if (!strcmp(token, "then")) return TOK_THEN; else if (!strcmp(token, "while")) return TOK_WHILE; else if (!strcmp(token, "do")) return TOK_DO; else if (!strcmp(token, "odd")) return TOK_ODD; return TOK_IDENT; } static int number(void) { const char *errstr; char *p; size_t i, j = 0, len; p = raw; while (isdigit(*raw) || *raw == '_') ++raw; len = raw - p; --raw; free(token); if ((token = malloc(len + 1)) == NULL) error("malloc failed"); for (i = 0; i < len; i++) { if (isdigit(*p)) token[j++] = *p; ++p; } token[j] = '\0'; (void) strtonum(token, 0, LONG_MAX, &errstr); if (errstr != NULL) error("invalid number: %s", token); return TOK_NUMBER; } static int lex(void) { again: /* Skip whitespace. */ while (*raw == ' ' || *raw == '\t' || *raw == '\n') { if (*raw++ == '\n') ++line; } if (isalpha(*raw) || *raw == '_') return ident(); if (isdigit(*raw)) return number(); switch (*raw) { case '{': comment(); goto again; case '.': case '=': case ',': case ';': case '#': case '<': case '>': case '+': case '-': case '*': case '/': case '(': case ')': return (*raw); case ':': if (*++raw != '=') error("unknown token: ':%c'", *raw); return TOK_ASSIGN; case '\0': return 0; default: error("unknown token: '%c'", *raw); } return 0; } /* * Parser. */ static void factor(void) { switch (type) { case TOK_IDENT: case TOK_NUMBER: next(); break; case TOK_LPAREN: expect(TOK_LPAREN); expression(); expect(TOK_RPAREN); } } static void term(void) { factor(); while (type == TOK_MULTIPLY || type == TOK_DIVIDE) { next(); factor(); } } static void expression(void) { if (type == TOK_PLUS || type == TOK_MINUS) next(); term(); while (type == TOK_PLUS || type == TOK_MINUS) { next(); term(); } } static void condition(void) { if (type == TOK_ODD) { expect(TOK_ODD); expression(); } else { expression(); switch (type) { case TOK_EQUAL: case TOK_HASH: case TOK_LESSTHAN: case TOK_GREATERTHAN: next(); break; default: error("invalid conditional"); } expression(); } } static void statement(void) { switch (type) { case TOK_IDENT: expect(TOK_IDENT); expect(TOK_ASSIGN); expression(); break; case TOK_CALL: expect(TOK_CALL); expect(TOK_IDENT); break; case TOK_BEGIN: expect(TOK_BEGIN); statement(); while (type == TOK_SEMICOLON) { expect(TOK_SEMICOLON); statement(); } expect(TOK_END); break; case TOK_IF: expect(TOK_IF); condition(); expect(TOK_THEN); statement(); break; case TOK_WHILE: expect(TOK_WHILE); condition(); expect(TOK_DO); statement(); } } static void block(void) { if (depth++ > 1) error("nesting depth exceeded"); if (type == TOK_CONST) { expect(TOK_CONST); expect(TOK_IDENT); expect(TOK_EQUAL); expect(TOK_NUMBER); while (type == TOK_COMMA) { expect(TOK_COMMA); expect(TOK_IDENT); expect(TOK_EQUAL); expect(TOK_NUMBER); } expect(TOK_SEMICOLON); } if (type == TOK_VAR) { expect(TOK_VAR); expect(TOK_IDENT); while (type == TOK_COMMA) { expect(TOK_COMMA); expect(TOK_IDENT); } expect(TOK_SEMICOLON); } while (type == TOK_PROCEDURE) { expect(TOK_PROCEDURE); expect(TOK_IDENT); expect(TOK_SEMICOLON); block(); expect(TOK_SEMICOLON); } statement(); if (--depth < 0) error("nesting depth fell below 0"); } static void parse(void) { next(); block(); expect(TOK_DOT); if (type != 0) error("extra tokens at end of file"); } /* * Main. */ int main(int argc, char *argv[]) { char *startp; if (argc != 2) { (void) fputs("usage: pl0c file.pl0\n", stderr); exit(1); } readin(argv[1]); startp = raw; parse(); free(startp); return 0; }
How do we know our front end is correct, anyway? Yes, you can take my word for it. But don't take my word for it. Have test programs that you can use to verify the correctness of the compiler. As we turn our attention to the code generator, we will have to make tweaks to the parser to store metadata about what we are parsing and pass that to the code generator. It would be good to, at each step along the way, ensure that we don't have any errors in the parser creep in. And what if we wanted to add new reserved words or fix the unary minus deficiency in the PL/0 grammar? It would be good to know we didn't break anything in those cases.