Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life



[prev]
[next]

2021-04-10
Demystifying programs that create programs, part 4: Parsing

All source code for this blog post can be found here.

On this episode of building our Z80 assembler, we are going to learn how to parse a line of assembly into tokens. We'll need to do this so that we can figure out what that line of assembly is trying to tell us. Then we will be able to generate the correct object code for that line.

You may have heard about concepts such as abstract syntax trees, or top-down and bottom-up parsers, or LL and LALR parsers. We are going to discuss none of that. We are instead going to take a much more direct approach: we will assume every line of assembly is its own independent universe. We can therefore parse a line and generate object code in one step. Once we have done that, we will discard the line we are currently working with, read the next line, and repeat the process.

Creating tokens

Recall the structure of a line of 8080 assembly language:

[label:] [op [arg1[, arg2]]] [; comment]

This means we can have up to five tokens per line. It is really only four, but I am going to have our assembler remember the comment if there is one. Maybe you will want to print the comments during assembly.

It is quite important that we have our assembler know the label, op, and arguments. Remember too that we will be making two passes through our source code: first to record the addresses of all the label declarations; second to generate object code. We will completely parse each line each pass, meaning we will perform two complete parses of the source code. It will perhaps take a little more time to do this but it is easy and it means that once we complete the first pass, we know that we are good during the code generation pass. Assembly time is still incredibly fast: on my OpenBSD laptop, a 2500 line program took 0.01 seconds to assemble. Assuming an effectively impossible assembly program of 65536 lines, that works out to be a quarter of a second to assemble such a huge program. I think we will be fine in terms of speed.

We'll need some more variables. Strings, in particular. Let's make them globals.

/**
 * Intel 8080 assembler instruction.
 */
static string lab;         /// Label
static string op;          /// Instruction mnemonic
static string a1;          /// First argument
static string a2;          /// Second argument
static string comm;        /// Comment

And now, our initial attempt at parsing logic. Let's not forget too that in our assemble function, parse currently doesn't take any arguments. That's going to change so let's change both of those to parse(lines[lineno]); as that means we should parse the current line.

/**
 * Parse each line into (up to) five tokens.
 */
static void parse(string line) {
    /* Reset all our variables.  */
    lab = null;
    op = null;
    a1 = null;
    a2 = null;
    comm = null;

    /* Remove any whitespace at the beginning of the line.  */
    auto preprocess = stripLeft(line);

    /* Split comment from the rest of the line.  */
    auto splitcomm = preprocess.findSplit(";");
    if (!splitcomm[2].empty)
        comm = strip(splitcomm[2]);

    /* Split second argument from the remainder.  */
    auto splita2 = splitcomm[0].findSplit(",");
    if (!splita2[2].empty)
        a2 = strip(splita2[2]);

    /* Split first argument from the remainder.  */
    auto splita1 = splita2[0].findSplit("\t");
    if (!splita1[2].empty) {
        a1 = strip(splita1[2]);
    } else {
        splita1 = splita2[0].findSplit(" ");
        if (!splita1[2].empty) {
            a1 = strip(splita1[2]);
        }
    }

    /* Split opcode from label.  */
    auto splitop = splita1[0].findSplit(":");
    if (!splitop[1].empty) {
        op = strip(splitop[2]);
        lab = strip(splitop[0]);
    } else {
        op = strip(splitop[0]);
    }
}

Perhaps a bit messy, but let's talk it out. We are also making some design decisions that will prevent us from being an exact replica of the CP/M assembler. But that's fine. We don't need to be an exact replica. We will be good enough to produce real software and the differences won't be all that significant by the end.

First thing we do is make sure all the variables are null. Since any and all tokens are optional, we want to reset all the variables to ensure we don't accidentally use stale information from a previous line of assembly.

Next, we remove any whitespace from the beginning of the line. There may or may not be any, but it doesn't hurt to make sure. The stripLeft function from the D standard library will do this for us. By the way, I am going to be using the auto type for most of the variables inside all our functions. This tells the D compiler to figure out what type this variable should be based on context. One less thing we need to worry about. We see the same .findSplit method that we used to do the fancy output file renaming from the first blog post on our assembler. We are using it now to parse our line from right to left. That might seem a little unusual at first, and perhaps parsing from left to right would allow us to exactly replicate the CP/M assembler syntax. But since I don't care about making an exact replica, and I can live with the limitations this will create, I think this will allow us to more easily see how the tokens are extracted. We can start a list of things to improve upon once we have a fully working assembler, and perhaps our parsing system will be one of those things.

First we are looking for the first instance of a semicolon. The semicolon is the comment character, so everything from that point onwards must be our comment. Remember too that .findSplit returns three strings: the left of the split point, the split point itself, and the right of the split point. If the split point doesn't exist, meaning there is no comment on this line, only the left of the split point string will exist (and it will be unchanged from the original string). In order for us to know if we should assign the comment to our comm variable, we check to see if the string right of the split point has content (is not empty) using the .empty method. If there is something there it's our comment, and in that case we should assign it to our comm variable. I also use strip to remove all whitespace both at the beginning and at the end of the token, since we don't want any leading or trailing whitespace in our tokens.

Next would be our arguments. If there are two arguments, we know they are separated by a comma. Therefore, we next take the left of the comment string split and split that string on a comma. We then do the same check on right of the newly split string to see if there is a second argument and if there is assign that to the a2 variable.

Next is the first argument. That is on whitespace, so we need to check for a tab first and if there is no tab then check for a space. Perform the same check and assign logic.

Finally we check for a colon, since that separates our opcode from our label declaration. Perform the same logic except this time don't forget to assign the values to both op and label if both exist. In the original CP/M assembler, a label declaration did not have to end with a colon. But it makes it easier for us to parse if we enforce a colon, and I happen to think it looks a little nicer and makes 8080 assembly code a little bit easier to read. So we will require the colon at the end of a label declaration.

After we do this, we will have all our tokens. And we will have the correct tokens for each line. And tokens that do not exist on that line will have their corresponding variable be empty. Neat!

Or, well, almost...

There is actually one small flaw in this parsing code. It turns out that if you have a label on the same line as an opcode, the opcode may accidentally end up as the first argument. So I will introduce a new idea that is sure to annoy purists but I believe is not only fine for us as beginners, but really happens in the real world: we will declare that misparse a special case and have code that checks and adjusts our variables if we detect the misparse. You are of course welcome to rewrite any and all code, but let's wait until we have a finished assembler before we do any rewriting. Better to have something totally working that we can then improve than not have something working.

Here is the logic for our special case. It goes at the end of our parsing code:

/**
 * Fixup for the label: op case.
 */
auto opFix = a1.findSplit("\t");
if (!opFix[1].empty) {
    op = strip(opFix[0]);
    a1 = strip(opFix[2]);
} else {
    opFix = a1.findSplit(" ");
    if (!opFix[1].empty) {
        op = strip(opFix[0]);
        a1 = strip(opFix[2]);
    } else {
        if (op.empty && !a1.empty && a2.empty) {
            op = a1;
            a1 = null;
        }
    }
}

What is really happening is that both the opcode and the first argument end up together in a1 in this special case. We need to check both the tab and the space, since we need to divide the opcode and the argument and they are separated by whitespace. I check tab first and I specifically check to see if the split point string exists and then take the appropriate action. As a last sanity check, if there is no whitespace, I check to see if op is empty but a1 is not, and if that's the case, I shift a1 to op, since that's clearly what it is. There can never be a case where there are arguments but no opcode.

We should be good now. That doesn't mean we won't have more special cases in the future. But for now, this is good. And it is very likely to take us most of the way through our assembler. So let's stick with it.

A good place to pause

While this is not the longest blog post, it is sufficiently important that we should stop here for today and make sure that we truly understand exactly what our parser is doing and how it works. Then we can move on.

Current state of the assembler

Here is the complete code for our assembler so far:

import std.stdio;
import std.file;
import std.algorithm;
import std.string;
import std.conv;
import std.exception;

/**
 * Line number.
 */
static size_t lineno;

/**
 * Pass.
 */
static int pass;

/**
 * Output stored in memory until we're finished.
 */
static ubyte[] output;

/**
 * Address for labels.
 */
static ushort addr;

/**
 * Intel 8080 assembler instruction.
 */
static string lab;         /// Label
static string op;          /// Instruction mnemonic
static string a1;          /// First argument
static string a2;          /// Second argument
static string comm;        /// Comment

/**
 * Top-level assembly function.
 * Everything cascades downward from here.
 * Repeat the parsing twice.
 * Pass 1 gathers symbols and their addresses/values.
 * Pass 2 emits code.
 */
static void assemble(string[] lines, string outfile)
{
    pass = 1;
    for (lineno = 0; lineno < lines.length; lineno++)
        parse(lines[lineno]);

    pass = 2;
    for (lineno = 0; lineno < lines.length; lineno++)
        parse(lines[lineno]);

    fileWrite(outfile);
}

/**
 * After all code is emitted, write it out to a file.
 */
static void fileWrite(string outfile) {
    import std.file : write;

    write(outfile, output);
}

/**
 * Parse each line into (up to) five tokens.
 */
static void parse(string line) {
    /* Reset all our variables.  */
    lab = null;
    op = null;
    a1 = null;
    a2 = null;
    comm = null;

    /* Remove any whitespace at the beginning of the line.  */
    auto preprocess = stripLeft(line);

    /* Split comment from the rest of the line.  */
    auto splitcomm = preprocess.findSplit(";");
    if (!splitcomm[2].empty)
        comm = strip(splitcomm[2]);

    /* Split second argument from the remainder.  */
    auto splita2 = splitcomm[0].findSplit(",");
    if (!splita2[2].empty)
        a2 = strip(splita2[2]);

    /* Split first argument from the remainder.  */
    auto splita1 = splita2[0].findSplit("\t");
    if (!splita1[2].empty) {
        a1 = strip(splita1[2]);
    } else {
        splita1 = splita2[0].findSplit(" ");
        if (!splita1[2].empty) {
            a1 = strip(splita1[2]);
        }
    }

    /* Split opcode from label.  */
    auto splitop = splita1[0].findSplit(":");
    if (!splitop[1].empty) {
        op = strip(splitop[2]);
        lab = strip(splitop[0]);
    } else {
        op = strip(splitop[0]);
    }

    /**
     * Fixup for the label: op case.
     */
    auto opFix = a1.findSplit("\t");
    if (!opFix[1].empty) {
        op = strip(opFix[0]);
        a1 = strip(opFix[2]);
    } else {
        opFix = a1.findSplit(" ");
        if (!opFix[1].empty) {
            op = strip(opFix[0]);
            a1 = strip(opFix[2]);
        } else {
            if (op.empty && !a1.empty && a2.empty) {
                op = a1;
                a1 = null;
            }
        }
    }
}

/**
 * Nice error messages.
 */
static void err(string msg)
{
    stderr.writeln("a80: " ~ to!string(lineno + 1) ~ ": " ~ msg);
    enforce(0);
}

/**
 * All good things start with a single function.
 */
void main(string[] args)
{
    /**
     * Make sure the user provides only one input file.
     */
    if (args.length != 2) {
        stderr.writeln("usage: a80 file.asm");
        return;
    }

    /**
     * Create an array of lines from the input file.
     */
    string[] lines = splitLines(cast(string)read(args[1]));

    /**
     * Name output file the same as the input but with .com ending.
     */
    auto split = args[1].findSplit(".asm");
    auto outfile = split[0] ~ ".com";

    /**
     * Do the work.
     */
    assemble(lines, outfile);
}

Next time

Now that we can parse lines of assembly into tokens, we can make decisions based on those tokens. These decisions will inform us what object code to generate.

You might want to add some writeln statements at the end of the parse function so that you can see the tokens. We'll remove them later but for now they might be a good debugging/verification tool. Something like this would work:

writeln(lab);
writeln(op);
writeln(a1);
writeln(a2);
writeln(comm);

And here is an assembly program that we can parse right now:

; Fibonacci in Intel 8080 assembler.
; Results in b
start:
	xra	a	; zero out a
	mov	b, a	; b = a
	mov	c, a	; c = a
	adi	01h	; a = a + 1
	mov	c, a	; c = a
	xra	a	; zero out a
loop:	add	c	; a = a + c
	cmp	c
	jc	start	; jump if carry
	mov	b, a
	mov	a, c
	mov	c, b
	jmp	loop	; jump to loop

Top

RSS