Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life



[prev]
[next]

2021-04-11
Demystifying programs that create programs, part 5: Processing our first opcode

All source code for this blog post can be found here.

Now that our Z80 assembler is able to parse lines of assembly code, we turn our attention to processing the results of our parsing into object code. We have two different processes to undertake: one for each pass.

Preparing for a process function

Let's create a new function called process that will make decisions based on which mnemonic we have. Let's call it from our assemble function, once our parse function returns, as that means we have a set of actionable tokens:

static void assemble(string[] lines, string outfile)
{
    pass = 1;
    for (lineno = 0; lineno < lines.length; lineno++) {
        parse(lines[lineno]);
        process();
    }

    pass = 2;
    for (lineno = 0; lineno < lines.length; lineno++) {
        parse(lines[lineno]);
        process();
    }

    fileWrite(outfile);
}

All we did here is add a call to process after each call to parse.

Writing the process function

What exactly should this process function do? It should:

That's a good bit of work. Let's start simple: let's perform this work for the nop instruction.

We are going to take an extremely easy approach to things: we will have a gigantic if ... else if ... else statement that contains every legal mnemonic. That does mean that in theory you could use the name of an mnemonic or a register as a label. But if you do that, you're on your own. We won't worry about such things.

Here is the initial logic for our process function:

/**
 * Figure out which mnemonic we have.
 */
static void process()
{
    /**
     * Special case for if you have a label by itself on a line.
     * Or have a totally blank line.
     */
    if (op.empty && a1.empty && a2.empty)
        return;

    /**
     * List of all valid mnemonics.
     */
    if (op == "nop")
        nop();
    else
        err("unknown mnemonic: " ~ op);
}

We include a special case for if we do not have an mnemonic, since that is legal. It could be a label and a comment, just a comment, just a label, or a totally blank line. In all those cases, we are finished processing. We have no opcode to process. Only if we have an opcode to process should we descend into what will become quite a long if ... else if ... else statement. For now though, we only have one valid mnemonic, nop. If at the end of processing we have not found a valid mnemonic, we should alert the user that their mnemonic is invalid and quit.

For organizational purposes, we will make the individual logic for each opcode its own function. Hence the nop function.

Different actions for different passes

Here is where we take different actions for different passes. Let's create a new function. I am going to call it passAct, short for pass action, that will check to see which pass we are in and then perform the correct action for that pass:

/**
 * Take action depending on which pass this is.
 */
static void passAct(ushort size, int outbyte)
{
    if (pass == 1) {
        /* Add new symbol if we have a label.  */
        if (!lab.empty)
            addsym();

        /* Increment address counter by size of instruction.  */
        addr += size;
    } else {
        /**
         * Output the byte representing the opcode.
         * If the opcode carries additional information
         *   (e.g., immediate or address), we will output that
         *   in a separate helper function.
         */
        if (outbyte >= 0)
            output ~= cast(ubyte)outbyte;
    }
}

Let's try out our new passAct in the case where we have no mnemonic. We might add a line to that section of the process function that results in this:

/**
 * Special case for if you put a label by itself on a line.
 * Or have a totally blank line.
 */
if (op.empty && a1.empty && a2.empty) {
    passAct(0, -1);
    return;
}

What does the addition of this one line of code do? It says if we have no mnemonic, we may still have a label, so let's descend into passAct. Since label declarations are not Intel 8080 instructions in and of themselves, they do not produce any output. That means they are an instruction of size 0. So our first argument to passAct in this case should be 0 and because we have nothing to output our second argument to passAct should be -1. Any time we have something less than 0 as the second argument to passAct, that prevents any output code from being produced. Exactly what we want in this case. Importantly, if there is a label, we will add it to our symbol table during the first pass with the addsym function. Let's tackle that now.

Symbol tables

We need a data structure to hold onto all the labels and their addresses. The data structure to hold this information will be a struct containing a string and a ushort. But that only handles the individual label and address pair. We will have many labels in our programs, so a single struct alone will not be sufficient. There are a couple of different data structures we could use to keep all the structs together. If we were going to be writing the next production quality assembler, we might choose a hash table since that allows really fast lookup. Turbo Pascal and indeed even the original Pascal compiler from Niklaus Wirth and his team eventually settled on the simpler Singly linked list.

We are going to settle on something even simpler. In D, you are allowed to make arrays of structs and you can append to such an array with the same ~ operator we know about from our strings.

Let's create the data structure and new array. I am putting these in the global scope:

/**
 * Individual symbol table entry.
 */
struct symtab
{
    string lab;         /// Symbol name
    ushort value;       /// Symbol value
};

/**
 * Symbol table is an array of entries.
 */
static symtab[] stab;

Finally, let's create a new function called addsym:

/**
 * Add a symbol to the symbol table.
 */
static void addsym()
{
    for (size_t i = 0; i < stab.length; i++) {
        if (lab == stab[i].lab)
            err("duplicate label: " ~ lab);
    }

    symtab newsym = { lab, addr };
    stab ~= newsym;
}

We iterate over the symbol table, one entry at a time, and compare the current label name with the name in the symbol table for that entry. If there is a match, we have a problem. You cannot redefine labels once they are defined. If we have a new label, we create a new symtab struct with the current label and the current address. Then we append this new symtab entry to the symbol table itself.

You might notice that we have no way of removing symbol table entries from the array. That is correct. It is a tried and true strategy for these kinds of short-lived programs to allocate memory as they need it and then let the hosted environment clean up the mess once it is finished. We are going to do the same. It makes our lives so much easier.

Processing the nop instruction

We are finally ready to process our nop instructions. Let's make a new function named nop. We should also take the time to ensure that we have the correct arguments for a nop, which is no arguments at all. And then if we do, we know we have a valid nop instruction and we can move to having passAct perform the correct action depending on which pass we are in.

It would be helpful to have our Intel 8080 opcode table open, since we need to know the size of the nop instruction, which is 1:

/**
 * nop (0x00)
 */
static void nop()
{
    argcheck(a1.empty && a2.empty);
    passAct(1, 0x00);
}

It looks like we need one last new function, argcheck. We will use this at the beginning of every opcode function to ensure the arguments are correct for that mnemonic. If it is correct, we tell passAct that nop is one byte in size and encodes to 0x00.

Argument checking

Argument checking is quite straightforward. We make sure the expression given as an argument evaluates to true and provide an error if it does not:

/**
 * Check arguments.
 */
static void argcheck(bool passed)
{
    if (passed == false)
        err("arguments not correct for mnemonic: " ~ op);
}

Trying it out

Here is the complete code of our assembler so far:

import std.stdio;
import std.file;
import std.algorithm;
import std.string;
import std.conv;
import std.exception;

/**
 * Line number.
 */
static size_t lineno;

/**
 * Pass.
 */
static int pass;

/**
 * Output stored in memory until we're finished.
 */
static ubyte[] output;

/**
 * Address for labels.
 */
static ushort addr;

/**
 * Intel 8080 assembler instruction.
 */
static string lab;      /// Label
static string op;       /// Instruction mnemonic
static string a1;       /// First argument
static string a2;       /// Second argument
static string comm;     /// Comment

/**
 * Individual symbol table entry.
 */
struct symtab
{
    string lab;         /// Symbol name
    ushort value;       /// Symbol value
};

/**
 * Symbol table is an array of entries.
 */
static symtab[] stab;

/**
 * Top-level assembly function.
 * Everything cascades downward from here.
 * Repeat the parsing twice.
 * Pass 1 gathers symbols and their addresses/values.
 * Pass 2 emits code.
 */
static void assemble(string[] lines, string outfile)
{
    pass = 1;
    for (lineno = 0; lineno < lines.length; lineno++) {
        parse(lines[lineno]);
        process();
    }

    pass = 2;
    for (lineno = 0; lineno < lines.length; lineno++) {
        parse(lines[lineno]);
        process();
    }

    fileWrite(outfile);
}

/**
 * After all code is emitted, write it out to a file.
 */
static void fileWrite(string outfile) {
    import std.file : write;

    write(outfile, output);
}

/**
 * Parse each line into (up to) five tokens.
 */
static void parse(string line) {
    /* Reset all our variables.  */
    lab = null;
    op = null;
    a1 = null;
    a2 = null;
    comm = null;

    /* Remove any whitespace at the beginning of the line.  */
    auto preprocess = stripLeft(line);

    /* Split comment from the rest of the line.  */
    auto splitcomm = preprocess.findSplit(";");
    if (!splitcomm[2].empty)
        comm = strip(splitcomm[2]);

    /* Split second argument from the remainder.  */
    auto splita2 = splitcomm[0].findSplit(",");
    if (!splita2[2].empty)
        a2 = strip(splita2[2]);

    /* Split first argument from the remainder.  */
    auto splita1 = splita2[0].findSplit("\t");
    if (!splita1[2].empty) {
        a1 = strip(splita1[2]);
    } else {
        splita1 = splita2[0].findSplit(" ");
        if (!splita1[2].empty) {
            a1 = strip(splita1[2]);
        }
    }

    /* Split op from label.  */
    auto splitop = splita1[0].findSplit(":");
    if (!splitop[1].empty) {
        op = strip(splitop[2]);
        lab = strip(splitop[0]);
    } else {
        op = strip(splitop[0]);
    }

    /**
     * Fixup for the label: op case.
     */
    auto opFix = a1.findSplit("\t");
    if (!opFix[1].empty) {
        op = strip(opFix[0]);
        a1 = strip(opFix[2]);
    } else {
        opFix = a1.findSplit(" ");
        if (!opFix[1].empty) {
            op = strip(opFix[0]);
            a1 = strip(opFix[2]);
        } else {
            if (op.empty && !a1.empty && a2.empty) {
                op = a1;
                a1 = null;
            }
        }
    }
}

/**
 * Figure out which op we have.
 */
static void process()
{
    /**
     * Special case for if you put a label by itself on a line.
     * Or have a totally blank line.
     */
    if (op.empty && a1.empty && a2.empty) {
        passAct(0, -1);
        return;
    }

    /**
     * List of all valid mnemonics.
     */
    if (op == "nop")
        nop();
    else
        err("unknown mnemonic: " ~ op);
}

/**
 * Take action depending on which pass this is.
 */
static void passAct(ushort size, int outbyte)
{
    if (pass == 1) {
        /* Add new symbol if we have a label.  */
        if (!lab.empty)
            addsym();

        /* Increment address counter by size of instruction.  */
        addr += size;
    } else {
        /**
         * Output the byte representing the opcode.
         * If the opcode carries additional information
         *   (e.g., immediate or address), we will output that
         *   in a separate helper function.
         */
        if (outbyte >= 0)
            output ~= cast(ubyte)outbyte;
    }
}

/**
 * Add a symbol to the symbol table.
 */
static void addsym()
{
    for (size_t i = 0; i < stab.length; i++) {
        if (lab == stab[i].lab)
            err("duplicate label: " ~ lab);
    }

    symtab newsym = { lab, addr };
    stab ~= newsym;
}

/**
 * nop (0x00)
 */
static void nop()
{
    argcheck(a1.empty && a2.empty);
    passAct(1, 0x00);
}

/**
 * Check arguments.
 */
static void argcheck(bool passed)
{
    if (passed == false)
        err("arguments not correct for mnemonic: " ~ op);
}

/**
 * Nice error messages.
 */
static void err(string msg)
{
    stderr.writeln("a80: " ~ to!string(lineno + 1) ~ ": " ~ msg);
    enforce(0);
}

/**
 * All good things start with a single function.
 */
void main(string[] args)
{
    /**
     * Make sure the user provides only one input file.
     */
    if (args.length != 2) {
        stderr.writeln("usage: a80 file.asm");
        return;
    }

    /**
     * Create an array of lines from the input file.
     */
    string[] lines = splitLines(cast(string)read(args[1]));

    /**
     * Name output file the same as the input but with .com ending.
     */
    auto split = args[1].findSplit(".asm");
    auto outfile = split[0] ~ ".com";

    /**
     * Do the work.
     */
    assemble(lines, outfile);
}

Compile the assembler and create a new file with nop in it and name the new file nop.asm. If you run it through the newly compiled assembler, it should produce a file named nop.com which if you open it with our disassembler from part 1 of this series should display 0000 nop, confirming that we correctly assembled our program. Congratulations! You have created a working assembler. You can even add a label and the assembler will still produce the same output! Excellent.

Of course, we still have a ways to go before we can say we have a finished assembler. But let's not overlook how far we've come: we are parsing 8080 assembly language and we can always output the correct object code for a subset of that language.

Next time

Next, we will continue to fill out our process if ... else if ... else statement. We need to follow the same process with all the remaining opcodes.

Top

RSS