Brian Robert Callahan
academic, developer, with an eye towards a brighter techno-social life
All source code for this blog post can be found here.
Now armed with a complete PL/0 compiler, we have a few last bits to clean up. The one I want to cover today is basic input and output. For us, that will mean the ability to print number and input numbers. Shorter than yesterday for sure, but very useful.
We won't be printing strings today. That will come later. You can fake it for now with repeated character printing.
Writing out is much more simple than reading in, so we'll start with that. Let's start by devising two new reserved words for writing.
Eventually, we will want to write numbers and characters (and strings). But we don't have a way to know if a particular identifier should be printed as a character or as a number based on the information we are storing about the identifier. Think about C: there is the putc
family which assumes characters always and there is the printf
family where you tell it how to interpret identifiers. To make our lives easy, we will have a function specifically for writing out numbers and another for writing out characters. I like simple to remember things, so I am going to use writeInt
for numbers and writeChar
for characters.
I added #define TOK_WRITEINT 'w'
and #define TOK_WRITECHAR 'H'
to our token list. Next was to teach the lexer what token strings correspond to these new token types:
else if (!strcmp(token, "writeInt")) return TOK_WRITEINT; else if (!strcmp(token, "writeChar")) return TOK_WRITECHAR;
I placed the above at the end of the ident()
function.
Next is to think about the grammar. These will be new types of statements, so their parsing logic will go in the statement()
function:
case TOK_WRITEINT: expect(TOK_WRITEINT); if (type == TOK_IDENT || type == TOK_NUMBER) { if (type == TOK_IDENT) symcheck(CHECK_RHS); cg_writeint(); } if (type == TOK_IDENT) expect(TOK_IDENT); else if (type == TOK_NUMBER) expect(TOK_NUMBER); else error("writeInt takes an identifier or a number"); break; case TOK_WRITECHAR: expect(TOK_WRITECHAR); if (type == TOK_IDENT || type == TOK_NUMBER) { if (type == TOK_IDENT) symcheck(CHECK_RHS); cg_writechar(); } if (type == TOK_IDENT) expect(TOK_IDENT); else if (type == TOK_NUMBER) expect(TOK_NUMBER); else error("writeChar takes an identifier or a number"); break;
As we can write out any right-hand side identifier or number, that is what we semantically check for. We'll also include a helpful error message if you tried writing out a semantically incorrect item. Yes, it's a good bit of duplicated logic but as always: correct first.
All that's left to do is write the code generator parts. I am going to use fprintf
for this. The cg_writechar()
function will print and unsigned character and the cg_writeint
function will print a long:
static void cg_writechar(void) { aout("(void) fprintf(stdout, \"%%c\", (unsigned char) %s);", token); } static void cg_writeint(void) { aout("(void) fprintf(stdout, \"%%ld\", (long) %s);", token); }
We are also going to need a header in our output C code. The fprintf()
function lives in stdio.h
. Let's create a new code generator function, cg_init()
, that will be run immediately when we enter the parser. Its job is to write out all the C headers and other boilerplate code we will need.:
static void cg_init(void) { aout("#include <stdio.h>\n"); }
And that's it! We can now immediately begin using writeInt
and writeChar
in our PL/0 programs. One thing to notice is that you don't get a newline after either so you'll need to put a writeChar 10
if you want a newline.
Also, we'll handle characters in the next episode. Characters and strings have a bit of a convoluted history in Pascal and are worth taking the time to discuss them completely separately.
This is a good bit more difficult. Not all of it, mind you. Reading in characters is no more difficult than writing out characters. It's reading in numbers that is difficult, since you don't know in advance how many characters it will be. We know that there is a maximum, it's the number of characters in LONG_MIN
. We can hold that with a 24-byte buffer, with some space left over, for ang LONG_MIN
that is 64-bit or smaller.
Additionally, we don't know in advance if a program will want to read in a number or not. And by the time we do figure it out, it's too late to deal with that information successfully. We will need to add to cg_init()
to always provide a buffer. The C compiler will remove it for us if it goes unused:
static void cg_init(void) { aout("#include <stdio.h>\n\n"); aout("static char __stdin[24];\n\n"); }
Hopefully no one ever needs an identifier named __stdin
. If someone does, we can mangle this identifier further.
Like the writes, we'll have to different read statements: readInt
for reading in numbers and readChar
for reading in characters. Let's add a #define TOK_READINT 'R'
and #define TOK_READCHAR 'h'
to our token list. Just the same as writing out, let's go ahead and teach the lexer to recognize these token strings:
else if (!strcmp(token, "readInt")) return TOK_READINT; else if (!strcmp(token, "readChar")) return TOK_READCHAR;
Additionally, I am going to add a little bit of syntactic sugar to read statements. You may optionally place the word "into" in between the read reserved word and the identifier. This may well make the intention of the code more clear. So let's add a #define TOK_INTO 'n'
to the token list and teach the lexer about it:
else if (!strcmp(token, "into")) return TOK_INTO;
And much the same, let's teach the parser about these two new statements:
case TOK_READINT: expect(TOK_READINT); if (type == TOK_INTO) expect(TOK_INTO); if (type == TOK_IDENT) { symcheck(CHECK_LHS); cg_readint(); } expect(TOK_IDENT); break; case TOK_READCHAR: expect(TOK_READCHAR); if (type == TOK_INTO) expect(TOK_INTO); if (type == TOK_IDENT) { symcheck(CHECK_LHS); cg_readchar(); } expect(TOK_IDENT); }
Noticed something interesting? These statements check that the identifier is a left-hand side identifier! Even though it might look funny, it makes sense: a read statement is a special-case of an assignment statement.
Now, the code generator for reading in characters:
static void cg_readchar(void) { aout("%s=(unsigned char) fgetc(stdin);", token); }
Here, we use fgetc()
to read in a character from stdin
.
And now for the difficult part, reading in numbers. We need to read the whole number into a buffer, convert it, make sure the conversion is valid, and finally assign the value of the conversion:
static void cg_readint(void) { aout("(void) fgets(__stdin, sizeof(__stdin), stdin);\n"); aout("if(__stdin[strlen(__stdin) - 1] == '\\n')"); aout("__stdin[strlen(__stdin) - 1] = '\\0';"); aout("%s=(long) strtonum(__stdin, LONG_MIN, LONG_MAX, &__errstr);\n", token); aout("if(__errstr!=NULL){"); aout("(void) fprintf(stderr, \"invalid number: %%s\\n\", __stdin);"); aout("exit(1);"); aout("}"); }
Looks like we're going to need more C headers in our output. We will also need another global, __errstr
.
Additionally, I apologize to the non-BSD people. The strtonum(3)
is so very useful and easy to use, and it's available in the OpenBSD libc. If you're on a platform that doesn't have strtonum
, you can add an #include "strtonum.c"
line in cg_init()
to make sure it always gets compiled in.
Once we've done that though, we've finished. We can now do basic input and output!
Here is the compiler so far with everything we did today:
#include <sys/stat.h> #include <ctype.h> #include <fcntl.h> #include <limits.h> #include <stdarg.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #define PL0C_VERSION "1.0.0" #define CHECK_LHS 0 #define CHECK_RHS 1 #define CHECK_CALL 2 #define TOK_IDENT 'I' #define TOK_NUMBER 'N' #define TOK_CONST 'C' #define TOK_VAR 'V' #define TOK_PROCEDURE 'P' #define TOK_CALL 'c' #define TOK_BEGIN 'B' #define TOK_END 'E' #define TOK_IF 'i' #define TOK_THEN 'T' #define TOK_WHILE 'W' #define TOK_DO 'D' #define TOK_ODD 'O' #define TOK_WRITEINT 'w' #define TOK_WRITECHAR 'H' #define TOK_READINT 'R' #define TOK_READCHAR 'h' #define TOK_INTO 'n' #define TOK_DOT '.' #define TOK_EQUAL '=' #define TOK_COMMA ',' #define TOK_SEMICOLON ';' #define TOK_ASSIGN ':' #define TOK_HASH '#' #define TOK_LESSTHAN '<' #define TOK_GREATERTHAN '>' #define TOK_PLUS '+' #define TOK_MINUS '-' #define TOK_MULTIPLY '*' #define TOK_DIVIDE '/' #define TOK_LPAREN '(' #define TOK_RPAREN ')' /* * pl0c -- PL/0 compiler. * * program = block "." . * block = [ "const" ident "=" number { "," ident "=" number } ";" ] * [ "var" ident { "," ident } ";" ] * { "procedure" ident ";" block ";" } statement . * statement = [ ident ":=" expression * | "call" ident * | "begin" statement { ";" statement } "end" * | "if" condition "then" statement * | "while" condition "do" statement * | "readInt" [ "into" ] ident * | "writeInt" ( ident | number ) * | "readChar" [ "into" ] ident * | "writeChar" ( ident | number) ] . * condition = "odd" expression * | expression ( "=" | "#" | "<" | ">" ) expression . * expression = [ "+" | "-" ] term { ( "+" | "-" ) term } . * term = factor { ( "*" | "/" ) factor } . * factor = ident * | number * | "(" expression ")" . */ static char *raw, *token; static int depth, proc, type; static size_t line = 1; struct symtab { int depth; int type; char *name; struct symtab *next; }; static struct symtab *head; static void expression(void); /* * Misc. functions. */ static void error(const char *fmt, ...) { va_list ap; (void) fprintf(stderr, "pl0c: error: %lu: ", line); va_start(ap, fmt); (void) vfprintf(stderr, fmt, ap); va_end(ap); (void) fputc('\n', stderr); exit(1); } static void readin(char *file) { int fd; struct stat st; if (strrchr(file, '.') == NULL) error("file must end in '.pl0'"); if (!!strcmp(strrchr(file, '.'), ".pl0")) error("file must end in '.pl0'"); if ((fd = open(file, O_RDONLY)) == -1) error("couldn't open %s", file); if (fstat(fd, &st) == -1) error("couldn't get size"); if ((raw = malloc(st.st_size + 1)) == NULL) error("malloc failed"); if (read(fd, raw, st.st_size) != st.st_size) error("couldn't read %s", file); raw[st.st_size] = '\0'; (void) close(fd); } /* * Lexer. */ static void comment(void) { int ch; while ((ch = *raw++) != '}') { if (ch == '\0') error("unterminated comment"); if (ch == '\n') ++line; } } static int ident(void) { char *p; size_t i, len; p = raw; while (isalnum(*raw) || *raw == '_') ++raw; len = raw - p; --raw; free(token); if ((token = malloc(len + 1)) == NULL) error("malloc failed"); for (i = 0; i < len; i++) token[i] = *p++; token[i] = '\0'; if (!strcmp(token, "const")) return TOK_CONST; else if (!strcmp(token, "var")) return TOK_VAR; else if (!strcmp(token, "procedure")) return TOK_PROCEDURE; else if (!strcmp(token, "call")) return TOK_CALL; else if (!strcmp(token, "begin")) return TOK_BEGIN; else if (!strcmp(token, "end")) return TOK_END; else if (!strcmp(token, "if")) return TOK_IF; else if (!strcmp(token, "then")) return TOK_THEN; else if (!strcmp(token, "while")) return TOK_WHILE; else if (!strcmp(token, "do")) return TOK_DO; else if (!strcmp(token, "odd")) return TOK_ODD; else if (!strcmp(token, "writeInt")) return TOK_WRITEINT; else if (!strcmp(token, "writeChar")) return TOK_WRITECHAR; else if (!strcmp(token, "readInt")) return TOK_READINT; else if (!strcmp(token, "readChar")) return TOK_READCHAR; else if (!strcmp(token, "into")) return TOK_INTO; return TOK_IDENT; } static int number(void) { const char *errstr; char *p; size_t i, j = 0, len; p = raw; while (isdigit(*raw) || *raw == '_') ++raw; len = raw - p; --raw; free(token); if ((token = malloc(len + 1)) == NULL) error("malloc failed"); for (i = 0; i < len; i++) { if (isdigit(*p)) token[j++] = *p; ++p; } token[j] = '\0'; (void) strtonum(token, 0, LONG_MAX, &errstr); if (errstr != NULL) error("invalid number: %s", token); return TOK_NUMBER; } static int lex(void) { again: /* Skip whitespace. */ while (*raw == ' ' || *raw == '\t' || *raw == '\n') { if (*raw++ == '\n') ++line; } if (isalpha(*raw) || *raw == '_') return ident(); if (isdigit(*raw)) return number(); switch (*raw) { case '{': comment(); goto again; case '.': case '=': case ',': case ';': case '#': case '<': case '>': case '+': case '-': case '*': case '/': case '(': case ')': return (*raw); case ':': if (*++raw != '=') error("unknown token: ':%c'", *raw); return TOK_ASSIGN; case '\0': return 0; default: error("unknown token: '%c'", *raw); } return 0; } /* * Code generator. */ static void aout(const char *fmt, ...) { va_list ap; va_start(ap, fmt); (void) vfprintf(stdout, fmt, ap); va_end(ap); } static void cg_call(void) { aout("%s();\n", token); } static void cg_const(void) { aout("const long %s=", token); } static void cg_crlf(void) { aout("\n"); } static void cg_end(void) { aout("/* PL/0 compiler %s */\n", PL0C_VERSION); } static void cg_epilogue(void) { aout(";"); if (proc == 0) aout("return 0;"); aout("\n}\n\n"); } static void cg_init(void) { aout("#include <limits.h>\n"); aout("#include <stdio.h>\n"); aout("#include <stdlib.h>\n"); aout("#include <string.h>\n\n"); aout("static char __stdin[24];\n"); aout("static const char *__errstr;\n\n"); } static void cg_odd(void) { aout(")&1"); } static void cg_procedure(void) { if (proc == 0) { aout("int\n"); aout("main(int argc, char *argv[])\n"); } else { aout("void\n"); aout("%s(void)\n", token); } aout("{\n"); } static void cg_readchar(void) { aout("%s=(unsigned char) fgetc(stdin);", token); } static void cg_readint(void) { aout("(void) fgets(__stdin, sizeof(__stdin), stdin);\n"); aout("if(__stdin[strlen(__stdin) - 1] == '\\n')"); aout("__stdin[strlen(__stdin) - 1] = '\\0';"); aout("%s=(long) strtonum(__stdin, LONG_MIN, LONG_MAX, &__errstr);\n", token); aout("if(__errstr!=NULL){"); aout("(void) fprintf(stderr, \"invalid number: %%s\\n\", __stdin);"); aout("exit(1);"); aout("}"); } static void cg_semicolon(void) { aout(";\n"); } static void cg_symbol(void) { switch (type) { case TOK_IDENT: case TOK_NUMBER: aout("%s", token); break; case TOK_BEGIN: aout("{\n"); break; case TOK_END: aout(";\n}\n"); break; case TOK_IF: aout("if("); break; case TOK_THEN: case TOK_DO: aout(")"); break; case TOK_ODD: aout("("); break; case TOK_WHILE: aout("while("); break; case TOK_EQUAL: aout("=="); break; case TOK_COMMA: aout(","); break; case TOK_ASSIGN: aout("="); break; case TOK_HASH: aout("!="); break; case TOK_LESSTHAN: aout("<"); break; case TOK_GREATERTHAN: aout(">"); break; case TOK_PLUS: aout("+"); break; case TOK_MINUS: aout("-"); break; case TOK_MULTIPLY: aout("*"); break; case TOK_DIVIDE: aout("/"); break; case TOK_LPAREN: aout("("); break; case TOK_RPAREN: aout(")"); } } static void cg_var(void) { aout("long %s;\n", token); } static void cg_writechar(void) { aout("(void) fprintf(stdout, \"%%c\", (unsigned char) %s);", token); } static void cg_writeint(void) { aout("(void) fprintf(stdout, \"%%ld\", (long) %s);", token); } /* * Semantics. */ static void symcheck(int check) { struct symtab *curr, *ret = NULL; curr = head; while (curr != NULL) { if (!strcmp(token, curr->name)) ret = curr; curr = curr->next; } if (ret == NULL) error("undefined symbol: %s", token); switch (check) { case CHECK_LHS: if (ret->type != TOK_VAR) error("must be a variable: %s", token); break; case CHECK_RHS: if (ret->type == TOK_PROCEDURE) error("must not be a procedure: %s", token); break; case CHECK_CALL: if (ret->type != TOK_PROCEDURE) error("must be a procedure: %s", token); } } /* * Parser. */ static void next(void) { type = lex(); ++raw; } static void expect(int match) { if (match != type) error("syntax error"); next(); } static void addsymbol(int type) { struct symtab *curr, *new; curr = head; while (1) { if (!strcmp(curr->name, token)) { if (curr->depth == (depth - 1)) error("duplicate symbol: %s", token); } if (curr->next == NULL) break; curr = curr->next; } if ((new = malloc(sizeof(struct symtab))) == NULL) error("malloc failed"); new->depth = depth - 1; new->type = type; if ((new->name = strdup(token)) == NULL) error("malloc failed"); new->next = NULL; curr->next = new; } static void destroysymbols(void) { struct symtab *curr, *prev; again: curr = head; while (curr->next != NULL) { prev = curr; curr = curr->next; } if (curr->type != TOK_PROCEDURE) { free(curr->name); free(curr); prev->next = NULL; goto again; } } static void initsymtab(void) { struct symtab *new; if ((new = malloc(sizeof(struct symtab))) == NULL) error("malloc failed"); new->depth = 0; new->type = TOK_PROCEDURE; new->name = "main"; new->next = NULL; head = new; } static void factor(void) { switch (type) { case TOK_IDENT: symcheck(CHECK_RHS); /* Fallthru */ case TOK_NUMBER: cg_symbol(); next(); break; case TOK_LPAREN: cg_symbol(); expect(TOK_LPAREN); expression(); if (type == TOK_RPAREN) cg_symbol(); expect(TOK_RPAREN); } } static void term(void) { factor(); while (type == TOK_MULTIPLY || type == TOK_DIVIDE) { cg_symbol(); next(); factor(); } } static void expression(void) { if (type == TOK_PLUS || type == TOK_MINUS) { cg_symbol(); next(); } term(); while (type == TOK_PLUS || type == TOK_MINUS) { cg_symbol(); next(); term(); } } static void condition(void) { if (type == TOK_ODD) { cg_symbol(); expect(TOK_ODD); expression(); cg_odd(); } else { expression(); switch (type) { case TOK_EQUAL: case TOK_HASH: case TOK_LESSTHAN: case TOK_GREATERTHAN: cg_symbol(); next(); break; default: error("invalid conditional"); } expression(); } } static void statement(void) { switch (type) { case TOK_IDENT: symcheck(CHECK_LHS); cg_symbol(); expect(TOK_IDENT); if (type == TOK_ASSIGN) cg_symbol(); expect(TOK_ASSIGN); expression(); break; case TOK_CALL: expect(TOK_CALL); if (type == TOK_IDENT) { symcheck(CHECK_CALL); cg_call(); } expect(TOK_IDENT); break; case TOK_BEGIN: cg_symbol(); expect(TOK_BEGIN); statement(); while (type == TOK_SEMICOLON) { cg_semicolon(); expect(TOK_SEMICOLON); statement(); } if (type == TOK_END) cg_symbol(); expect(TOK_END); break; case TOK_IF: cg_symbol(); expect(TOK_IF); condition(); if (type == TOK_THEN) cg_symbol(); expect(TOK_THEN); statement(); break; case TOK_WHILE: cg_symbol(); expect(TOK_WHILE); condition(); if (type == TOK_DO) cg_symbol(); expect(TOK_DO); statement(); break; case TOK_WRITEINT: expect(TOK_WRITEINT); if (type == TOK_IDENT || type == TOK_NUMBER) { if (type == TOK_IDENT) symcheck(CHECK_RHS); cg_writeint(); } if (type == TOK_IDENT) expect(TOK_IDENT); else if (type == TOK_NUMBER) expect(TOK_NUMBER); else error("writeInt takes an identifier or a number"); break; case TOK_WRITECHAR: expect(TOK_WRITECHAR); if (type == TOK_IDENT || type == TOK_NUMBER) { if (type == TOK_IDENT) symcheck(CHECK_RHS); cg_writechar(); } if (type == TOK_IDENT) expect(TOK_IDENT); else if (type == TOK_NUMBER) expect(TOK_NUMBER); else error("writeChar takes an identifier or a number"); break; case TOK_READINT: expect(TOK_READINT); if (type == TOK_INTO) expect(TOK_INTO); if (type == TOK_IDENT) { symcheck(CHECK_LHS); cg_readint(); } expect(TOK_IDENT); break; case TOK_READCHAR: expect(TOK_READCHAR); if (type == TOK_INTO) expect(TOK_INTO); if (type == TOK_IDENT) { symcheck(CHECK_LHS); cg_readchar(); } expect(TOK_IDENT); } } static void block(void) { if (depth++ > 1) error("nesting depth exceeded"); if (type == TOK_CONST) { expect(TOK_CONST); if (type == TOK_IDENT) { addsymbol(TOK_CONST); cg_const(); } expect(TOK_IDENT); expect(TOK_EQUAL); if (type == TOK_NUMBER) { cg_symbol(); cg_semicolon(); } expect(TOK_NUMBER); while (type == TOK_COMMA) { expect(TOK_COMMA); if (type == TOK_IDENT) { addsymbol(TOK_CONST); cg_const(); } expect(TOK_IDENT); expect(TOK_EQUAL); if (type == TOK_NUMBER) { cg_symbol(); cg_semicolon(); } expect(TOK_NUMBER); } expect(TOK_SEMICOLON); } if (type == TOK_VAR) { expect(TOK_VAR); if (type == TOK_IDENT) { addsymbol(TOK_VAR); cg_var(); } expect(TOK_IDENT); while (type == TOK_COMMA) { expect(TOK_COMMA); if (type == TOK_IDENT) { addsymbol(TOK_VAR); cg_var(); } expect(TOK_IDENT); } expect(TOK_SEMICOLON); cg_crlf(); } while (type == TOK_PROCEDURE) { proc = 1; expect(TOK_PROCEDURE); if (type == TOK_IDENT) { addsymbol(TOK_PROCEDURE); cg_procedure(); } expect(TOK_IDENT); expect(TOK_SEMICOLON); block(); expect(TOK_SEMICOLON); proc = 0; destroysymbols(); } if (proc == 0) cg_procedure(); statement(); cg_epilogue(); if (--depth < 0) error("nesting depth fell below 0"); } static void parse(void) { cg_init(); next(); block(); expect(TOK_DOT); if (type != 0) error("extra tokens at end of file"); cg_end(); } /* * Main. */ int main(int argc, char *argv[]) { if (argc != 2) { (void) fputs("usage: pl0c file.pl0\n", stderr); exit(1); } readin(argv[1]); initsymtab(); parse(); return 0; }
We have implemented enough to be able to write the traditional Hello world program. It's a bit convoluted, but it showcases many of the constructs our compiler can handle:
procedure hello; var i, x; begin i := 0; x := 72; writeChar x; x := x + 29; writeChar x; x := x + 7; while i < 2 do begin writeChar x; i := i + 1 end; x := x + 3; writeChar x; writeChar 44; writeChar 32; x := x + 8; writeChar x; x := x - 8; writeChar x; x := x + 3; writeChar x; x := x - (3 * 2); writeChar x; x := x - 8; writeChar x; x := x / 3; writeChar x; writeChar 10 end; call hello .
We'll deal with arrays next. This will allow us to carve out arbitrary-sized blocks of memory for whatever purpose we'd like. After that, we'll have implemented enough to rewrite the compiler in its own language, though we may still want to add extra facilities to the C implementation to make life easier for us.
The new semester is starting up soon. So if there is a day or two before the next episode, forgive me. These blog posts take a while to write up.