Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life


Let's write a compiler, part 1: Introduction, selecting a language, and doing some planning

All source code for this blog post can be found here.

I think we all knew this might be coming once we finished up with our assembler. While we can in theory solve any computable problem by writing the appropriate assembly for our assembler, writing assembly code can be a bit more cumbersome than a high-level language. There is a reason why we did not write our assembler in assembly but instead chose D for the implementation language.

Today we will begin writing a compiler for a real high-level language. It will be a simple language for sure. But let's not let simplicity stand in our way. We will be able to write real programs with our compiler. How real, you ask? Let's plan on writing a compiler for our selected language and then being able to write a compiler in that same language the compiler compiles using our compiler to compile it! From zero to compiler to self-hosted compiler. It will take a good bit of work for sure, but I think we are up to the task. We'll spend this series writing our initial, non-self-hosting compiler. We'll take some time to enjoy our work, then we'll come back and embark on a second series that develops the self-hosted compiler using the compiler we will write in this series.

For today, let's select a language for our compiler. Two languages, actually. One language will be the language that the compiler understands. The other will be the language that the compiler is written in. Once we've done that, let's do some organizational work, discuss some of the different ways we could implement our compiler, and choose one of those implementation methods.

What our compiler will be, and, importantly, what it will not be

Before we get started, I want to make it clear that will we not be producing an incredibly complex compiler using state-of-the-art techniques. Nor will we be diving headfirst into compiler theory. Instead, much like our assembler, we will produce a "good enough" compiler using basic techniques. This blog series is a hands-on practical guide to creating your first compiler. Once you've mastered this compiler you can expand your horizons and begin to incorporate further enhancements to it, to all the difference pieces of the compiler. We will be writing a single-pass compiler for a simple language and immediately output our final output code as soon as our compiler has enough information to do so. That strategy is good enough for Turbo Pascal, so it's good enough for me.

A language to compile

There are a lot of programming languages out there. Rosetta Code, a programming chrestomathy site, is aware of 838 languages as of me writing this blog post, and that almost certainly represents only a fraction of all the programming languages that have been implemented, let alone devised.

We could devise our own language, but I think it might be easier for us to choose a language that already exists. To make our lives a bit easier, let's select a language that can be implemented in what is known as a single-pass compiler; that is, we will choose a language whose syntax requires our compiler to look at the input source code only one time. If you remember from our assembler, that was a two-pass assembler because we had to solve the problem of what happens if you reference a label before declaring it and so the assembler had to read the source code twice: first to figure out all the label addresses and second to use that information to generate the binary code. Whichever language we choose for our compiler to understand, it won't have this problem. Better still, let's select a language that has a history of usage in teaching compiler construction.

That might lead us to something like Pascal, which was developed specifically for teaching programming and was designed to be parsed with a single-pass compiler. But Pascal has a lot of extra complexity which, while very useful for a richer programming experience, probably is not necessary for us just learning. However, there is a Pascal-like language, also developed by Niklaus Wirth, that was developed specifically for teaching compiler construction. That language is PL/0, and humorously it is not known by Rosetta Code despite being first described in the mid-1970s and used to teach compiler construction ever since. Though it will be outside the scope of this series, if you wanted you could use our finished PL/0 compiler as a starting point to build up to a complete Pascal compiler.

A language to implement our PL/0 compiler in

I am thinking that we should choose C to implement our PL/0 compiler. Eventually we will want to have this compiler run on many different platforms, including Unix, MS-DOS, and potentially even CP/M. C is a language that will run on every platform I would want to run our PL/0 compiler on, so C works for me. C also requires no special setup so our usual development platform of vim on OpenBSD will work just fine. Finally, it will help with bootstrapping the self-hosting version of the PL/0 compiler, since we in theory could natively compile the C implementation on our target platform then use that to compile the PL/0 implementation, also on the target platform.

Some organizational considerations and a basic walkthrough of a compiler

Let's plan out our PL/0 compiler a little bit before we start writing any code. Let's start by reminding ourselves what a compiler should do by comparing it to our assembler.

The workflow of our assembler looked roughly something like this:

| Start |
+----------------+    +------------+      -------      +--------+
| Read in a line |    | Parse into |     /       \  No | Write  |
|                |--->|            |--->| Pass 1? |--->|        |
| of assembly    |    | tokens     |     \       /     | binary |
+----------------+    +------------+      -------      +--------+
        ^                                    |             |
        |                                    | Yes         |
        |                                    |             |
        |                                    V             |
        |                               +---------+        |
        |                               | Record  |        |
        |                               | address |        |
        |                               +---------+        |
        |                                    |             V
        |                                    |       -----------
        |                                    |  No  / End of    \
        +------------------------------------+-----|             |
                                                    \ assembly? /
                                                           | Yes
                                                        | End |

The assembler as we wrote it is surprisingly straightforward. We were able to treat each line as its own independent universe and generate the binary code for that line without needing to possess any additional context other than potentially the address of a symbol, which we collected during the first pass. We might be able to use a similar strategy with our compiler. However, it won't be as neat and tidy as with our assembler. With the assembler, the newline character was the instruction terminator. That won't quite be the same for PL/0.

By the way, here is the syntax for PL/0:

program		= block "." .
block		= [ "const" ident "=" number { "," ident "=" number } ";" ]
		  [ "var" ident { "," ident } ";" ]
		  { "procedure" ident ";" block ";" } statement .
statement	= [ ident ":=" expression
		  | "call" ident
		  | "begin" statement { ";" statement } "end"
		  | "if" condition "then" statement
		  | "while" condition "do" statement ] .
condition	= "odd" expression
		| expression ( "=" | "#" | "<" | ">" ) expression .
expression	= [ "+" | "-" ] term { ( "+" | "-" ) term } .
term		= factor { ( "*" | "/" ) factor } .
factor		= ident
		| number
		| "(" expression ")" .
ident		= "A-Za-z_" { "A-Za-z0-9_" } .
number		= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" .

Don't worry if this doesn't make sense to you right now. But still, try to read through it and see if you can mentally parse what the syntax is trying to tell you. If you still can't, don't worry we will see this again when we write our parser and talk through it then.

Now, let's consider the rough workflow of our compiler:

| Start |
    |                                      No
    +--+                   +------------------------------+
       |                   |                              |
       V                   |                              |
+-------------+            V                        ------------
| Read in     |   +-------------+    +-------+     / Ready to   \
|             |-->| Lex a token |--->| Parse |--->|              |
| source code |   +-------------+    +-------+     \ emit code? /
+-------------+            ^                        ------------
                           |                              |
                           |          +-----------+  Yes  |
                           |          | Emit code |<------+
                           |          +-----------+
                           |No              |
                           |                |
                           |                V
                           |          --------------
                           |         / End of       \  Yes +-----+
                           +--------|                |---->| End |
                                     \ source code? /      +-----+

We would be wise to break up these different parts of the compiler organizationally. There are three main parts for this compiler: the lexer, the parser, and the code generator. The lexer's job is to take the free-form source code and turn it into a series of tokens, the individual units of the language. The parser's job is to take that series of tokens and ensure that their ordering follows the rules of the language syntax. The code generator's job is to produce equivalent... well, equivalent what, exactly?

Well, it has to be some other language but which one to pick? It depends on how much effort we want to spend on our backend. And it doesn't even have to be just one language, either. The GCC compiler uses multiple intermediate languages in order to enable powerful optimization techniques.

We could choose a CPU and have our compiler output the equivalent assembly language for that CPU. But since we are already writing our initial compiler in C, I think we should output C. Outputting to C is a well-recognized compilation strategy. And it will allow us to have the PL/0 code we will write and feed to our compiler ultimately produce native code for whatever platform we happen to be using. It turns out that it will also save us a lot of work in our code generator.

Breaking it down into steps

There is more than one approach to converting source code into the final output code. Based on the workflow graph above, we will have all three main parts—the lexer, the parser, and the code generator—all be intertwined with each other. There will be a clear demarcation in the code which functions belong to which part, but they will still be intertwined in action. We will have a situation where the parser drives the action. The parser will call on the lexer to hand over tokens one at a time as the parser needs tokens. And once the parser has enough information to make a decision on what C code to output next, it will call the code generator to produce that C code. The code generator never gives any information back to the parser; the parser assumes that the code generator will do what it is told. The parser will repeat this cycle until it reaches the end of the source code.

In graphic form, this is how the different parts of the compiler will communicate with each other:

+-------+     +--------+     +----------------+
| Lexer |<--->| Parser |---->| Code generator |
+-------+     +--------+     +----------------+

But this is not the only way the work can be managed and the different parts of a compiler can communicate with each other. We could instead take an approach where a standalone lexer reads in source code and writes out a temporary file that contains a list of tokens with some additional contextual information. We could then have a standalone parser that reads in the temporary file produced by the lexer and parses that, producing some sort of temporary data structure like an abstract syntax tree. We could then have a standalone code generator that reads in that temporary data structure produced by the parser and writes out the final output code. This is somewhat similar to the strategy employed by the Cowgol compiler; each major task is its own separate binary and intermediate code is written out to files to be used as input for the next binary in the chain.

We could go even further and break things down into the tiniest units of work and write a separate pass, or separate standalone utility, to handle each of these individual units of work. This would be a nanopass compiler.

Our first language extension

Let's also add in this planning stage our first extension to the PL/0 language our compiler will understand. One thing that is missing from the syntax is any comment characters. It is always good to be able to put comments into our source code. In Pascal, there are two comment character sets, { ... } and (* ... *) which permit nested comments and the ability to mix and match opening and closing comment characters. I don't think I want to implement quite that much. I will choose { ... } for our PL/0 comment characters and not have nested comments. I think that is good enough for us.

On the next episode: A lexer

Even though we didn't write any code in this post, we are now ready to begin our coding adventure. Coming up, we will turn our attention to writing our first bits of code. We will write a lexer for the PL/0 language, which will allow us to feed the compiler any free-form PL/0 source code and have the compiler return to us the first important structure of a programming language: all the tokens in order from that free-form source code.