Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life



[prev]
[next]

2021-04-12
Demystifying programs that create programs, part 6: Processing more opcodes

All source code for this blog post can be found here.

Our Z80 assembler processing nop statements is great. But I suspect we will want to process other opcodes as well so that we can do real work. Today is the day that we really need our Intel 8080 opcode table, so let's open it up and get ready to teach our assembler all about the 8080 opcodes.

Working down the table

This will take a good bit of time. But we can take some shortcuts. First, where there are duplicate opcodes for an instruction, we only need to implement the primary opcode. Take nop for example. It can encode as 0x00, 0x08, 0x10, 0x18, 0x20, 0x28, 0x30, or 0x38. But we don't need to encode all those those. Just having the 0x00 encoding is enough. That means any mnemonic in the opcode table that begins with a * we can ignore, since that's how the opcode table tells us it's a duplicate encoding.

I am going to divide the opcode table into my own arbitrary quarters: 0x00-0x3f, 0x40-0x7f, 0x80-0xbf, and 0xc0-0xff. The first quarter is (roughly) incrementers, decrementers, and rotates. The second quarter is entirely devoted to mov. The third quarter is entirely devoted to arithmetic. The fourth quarter is (mostly) control operators like jumps, calls, and returns. I am going to start with the second quarter, since we can tackle a whole quarter of the opcodes with just one function.

Do we really need 64 versions of mov?

In short, yes. And one of them is the hlt instruction, but we will deal with that when we get there. It's actually really smart. It allows us to guarantee any combination of mov between two arbitrary registers always encodes as a one-byte instruction. We will also now learn about the register pattern in the opcode table so that we can greatly reduce the logic we need.

Here is the base logic for encoding a mov:

/**
 * mov (0x40 + (8-bit register offset << 3) + 8-bit register offset
 * We allow mov m, m (0x76)
 * But that will result in HLT.
 */
static void mov()
{
    argcheck(!a1.empty && !a2.empty);
    passAct(1, 0x40 + (regMod8(a1) << 3) + regMod8(a2));
}

Every mov must have two arguments, and they must both be registers. You'll also notice in my comment I mention something about an 8-bit register offset and allowing an impossible mov m, m, which encodes into a hlt instruction. Let's break it down one at a time.

The register pattern

In Intel 8080 assembly, there are seven registers: b, c, d, e, h, l, and a. There is an additional pseudo-register m which refers to the memory address pointed to by the register pair hl but can be used as any other register. That is really an implementation detail that we do not need to care about, so let's go ahead and say that m is the eighth register.

Registers are understood in the following order, with the following values:

If we look at the opcode table, where we have instructions that can take arbitrary registers (e.g., mov, add, sub), the registers always appear in that order. Therefore, add b comes before add c which comes before add d and so on. This makes our lives a whole lot easier in calculating the correct encoding any particular mov instruction. A mov encoding follows this formula:

0x40 + (first register * 8) + (second register)

For added speed, any time you have a multiplication by a power of two, you can replace it with a bitshift to the left. Eight is two to the power of three, therefore we can replace (first register * 8) with (first register << 3).

Because of this, our passAct call from the mov function will have a 1 in the first argument, since all mov instructions are one byte in size, and the actual byte to encode is the result of the formula in the second argument. And we will know since we follow the formula we will always get the correct encoding.

Calculating 8-bit register offsets

I call the function to calculate the offset regMod8, short for 8-bit register modifier. I already know there will be a 16-bit version, but we will worry about that later. This is what it looks like:

/**
 * Return the 8-bit register offset.
 */
static int regMod8(string reg)
{
    if (reg == "b")
        return 0x00;
    else if (reg == "c")
        return 0x01;
    else if (reg == "d")
        return 0x02;
    else if (reg == "e")
        return 0x03;
    else if (reg == "h")
        return 0x04;
    else if (reg == "l")
        return 0x05;
    else if (reg == "m")
        return 0x06;
    else if (reg == "a")
        return 0x07;
    else
        err("invalid register " ~ reg);

    /* This will never be reached, but quiets gdc.  */
    return 0;
}

This function gets passed the argument we want to check (a1 or a2, depending on context) and then iterates through the register list. Once it finds the correct register, it returns that value. If the argument is not a register, then you get an error. My gdc compiler does not appear to understand that err always produces an error, and so I add an extra return statement to prevent a warning, but in reality we will never reach that final return statement. Very recently, D added the concept of noreturn to the language, but that has not trickled down to my version of gdc. Eventually that extra return statement will go away but it is fine for us to have right now. It doesn't harm anything.

Allowing a fake instruction

The other part of the mov function comment is that I say we permit mov m, m even though that is not a real instruction. In our opcode table, the cell that would normally contain that instruction actually contains hlt. We could add some logic to the mov function to check if the resulting encoding byte is 0x76 and error out of it is. But I don't think that's worth it. We do still need to add a hlt function:

/**
 * hlt (0x76)
 */
static void hlt()
{
    argcheck(a1.empty && a2.empty);
    passAct(1, 0x76);
}

I hope it is getting easier to read these encoding functions. A hlt instruction has no arguments, is one byte in size, and encodes to 0x76.

Hooking up mov and hlt

Now that our mov and hlt functions are complete, let's hook them up to the mnemonic list. Add the following to the process function:

    if (op == "nop")
        nop();
    else if (op == "mov")
        mov();
    else if (op == "hlt")
        hlt();
    else
        err("unknown mnemonic: " ~ op);

And now our assembler knows how to process over a quarter of all possible instructions!

As we add more instructions, I am going to keep them sorted by base encoding number. You can sort in any order you want, but I think mine makes things easier since it will look like the opcode table.

Adding arithmetic instructions

Now let's add the arithmetic instructions from the third quarter of the opcode table. Then we will be halfway done encoding all possible instructions. We will tackle the first and fourth quarters of the opcode table in the coming days.

There are eight arithmetic mnemonics: add, adc, sub, sbb, ana, xra, ora, and cmp. All arithmetic opcodes implicitly have the a register as the first operand, and the a register is also the location where the result of all arithmetic operations is placed. That means arithmetic instructions take only one argument, and all of these are registers. There are versions that take an immediate (an 8-bit number), but those are located in the fourth quarter of the opcode table so we won't worry about those versions just yet.

See if you can write your own functions for each of the arithmetic opcodes. They are all very similar: each takes exactly one argument which is a register, each encodes to exactly one byte, and each (like mov) has a base encoding value to which you add the result of our regMod8 function. Spend some time with the opcode table and see if you can write those functions yourself.

I hope you spent some time doing that. Check against mine here and see if you came up with the same functions I did:

/**
 * add (0x80 + 8-bit register offset)
 */
static void add()
{
    argcheck(!a1.empty && a2.empty);
    passAct(1, 0x80 + regMod8(a1));
}

/**
 * adc (0x88 + 8-bit register offset)
 */
static void adc()
{
    argcheck(!a1.empty && a2.empty);
    passAct(1, 0x88 + regMod8(a1));
}

/**
 * sub (0x90 + 8-bit register offset)
 */
static void sub()
{
    argcheck(!a1.empty && a2.empty);
    passAct(1, 0x90 + regMod8(a1));
}

/**
 * sbb (0x98 + 8-bit register offset)
 */
static void sbb()
{
    argcheck(!a1.empty && a2.empty);
    passAct(1, 0x98 + regMod8(a1));
}

/**
 * ana (0xa0 + 8-bit register offset)
 */
static void ana()
{
    argcheck(!a1.empty && a2.empty);
    passAct(1, 0xa0 + regMod8(a1));
}

/**
 * xra (0xa8 + 8-bit register offset)
 */
static void xra()
{
    argcheck(!a1.empty && a2.empty);
    passAct(1, 0xa8 + regMod8(a1));
}

/**
 * ora (0xb0 + 8-bit register offset)
 */
static void ora()
{
    argcheck(!a1.empty && a2.empty);
    passAct(1, 0xb0 + regMod8(a1));
}

/**
 * cmp (0xb8 + 8-bit register offset)
 */
static void cmp()
{
    argcheck(!a1.empty && a2.empty);
    passAct(1, 0xb8 + regMod8(a1));
}

Let hook these up to our mnemonic list in the process function and then our assembler will be able to correctly assemble over half of all the 8080 instructions!

Current state of the assembler

Here is the assembler as it stands now:

import std.stdio;
import std.file;
import std.algorithm;
import std.string;
import std.conv;
import std.exception;

/**
 * Line number.
 */
static size_t lineno;

/**
 * Pass.
 */
static int pass;

/**
 * Output stored in memory until we're finished.
 */
static ubyte[] output;

/**
 * Address for labels.
 */
static ushort addr;

/**
 * Intel 8080 assembler instruction.
 */
static string lab;      /// Label
static string op;       /// Instruction mnemonic
static string a1;       /// First argument
static string a2;       /// Second argument
static string comm;     /// Comment

/**
 * Individual symbol table entry.
 */
struct symtab
{
    string lab;         /// Symbol name
    ushort value;       /// Symbol value
};

/**
 * Symbol table is an array of entries.
 */
static symtab[] stab;

/**
 * Top-level assembly function.
 * Everything cascades downward from here.
 * Repeat the parsing twice.
 * Pass 1 gathers symbols and their addresses/values.
 * Pass 2 emits code.
 */
static void assemble(string[] lines, string outfile)
{
    pass = 1;
    for (lineno = 0; lineno < lines.length; lineno++) {
        parse(lines[lineno]);
        process();
    }

    pass = 2;
    for (lineno = 0; lineno < lines.length; lineno++) {
        parse(lines[lineno]);
        process();
    }

    fileWrite(outfile);
}

/**
 * After all code is emitted, write it out to a file.
 */
static void fileWrite(string outfile) {
    import std.file : write;

    write(outfile, output);
}

/**
 * Parse each line into (up to) five tokens.
 */
static void parse(string line) {
    /* Reset all our variables.  */
    lab = null;
    op = null;
    a1 = null;
    a2 = null;
    comm = null;

    /* Remove any whitespace at the beginning of the line.  */
    auto preprocess = stripLeft(line);

    /* Split comment from the rest of the line.  */
    auto splitcomm = preprocess.findSplit(";");
    if (!splitcomm[2].empty)
        comm = strip(splitcomm[2]);

    /* Split second argument from the remainder.  */
    auto splita2 = splitcomm[0].findSplit(",");
    if (!splita2[2].empty)
        a2 = strip(splita2[2]);

    /* Split first argument from the remainder.  */
    auto splita1 = splita2[0].findSplit("\t");
    if (!splita1[2].empty) {
        a1 = strip(splita1[2]);
    } else {
        splita1 = splita2[0].findSplit(" ");
        if (!splita1[2].empty) {
            a1 = strip(splita1[2]);
        }
    }

    /* Split op from label.  */
    auto splitop = splita1[0].findSplit(":");
    if (!splitop[1].empty) {
        op = strip(splitop[2]);
        lab = strip(splitop[0]);
    } else {
        op = strip(splitop[0]);
    }

    /**
     * Fixup for the label: op case.
     */
    auto opFix = a1.findSplit("\t");
    if (!opFix[1].empty) {
        op = strip(opFix[0]);
        a1 = strip(opFix[2]);
    } else {
        opFix = a1.findSplit(" ");
        if (!opFix[1].empty) {
            op = strip(opFix[0]);
            a1 = strip(opFix[2]);
        } else {
            if (op.empty && !a1.empty && a2.empty) {
                op = a1;
                a1 = null;
            }
        }
    }
}

/**
 * Figure out which op we have.
 */
static void process()
{
    /**
     * Special case for if you put a label by itself on a line.
     * Or have a totally blank line.
     */
    if (op.empty && a1.empty && a2.empty) {
        passAct(0, -1);
        return;
    }

    /**
     * List of all valid mnemonics.
     */
    if (op == "nop")
        nop();
    else if (op == "mov")
        mov();
    else if (op == "hlt")
        hlt();
    else if (op == "add")
        add();
    else if (op == "adc")
        adc();
    else if (op == "sub")
        sub();
    else if (op == "sbb")
        sbb();
    else if (op == "ana")
        ana();
    else if (op == "xra")
        xra();
    else if (op == "ora")
        ora();
    else if (op == "cmp")
        cmp();
    else
        err("unknown mnemonic: " ~ op);
}

/**
 * Take action depending on which pass this is.
 */
static void passAct(ushort size, int outbyte)
{
    if (pass == 1) {
        /* Add new symbol if we have a label.  */
        if (!lab.empty)
            addsym();

        /* Increment address counter by size of instruction.  */
        addr += size;
    } else {
        /**
         * Output the byte representing the opcode.
         * If the opcode carries additional information
         *   (e.g., immediate or address), we will output that
         *   in a separate helper function.
         */
        if (outbyte >= 0)
            output ~= cast(ubyte)outbyte;
    }
}

/**
 * Add a symbol to the symbol table.
 */
static void addsym()
{
    for (size_t i = 0; i < stab.length; i++) {
        if (lab == stab[i].lab)
            err("duplicate label: " ~ lab);
    }

    symtab newsym = { lab, addr };
    stab ~= newsym;
}

/**
 * nop (0x00)
 */
static void nop()
{
    argcheck(a1.empty && a2.empty);
    passAct(1, 0x00);
}

/**
 * mov (0x40 + (8-bit register offset << 3) + 8-bit register offset
 * We allow mov m, m (0x76)
 * But that will result in HLT.
 */
static void mov()
{
    argcheck(!a1.empty && !a2.empty);
    passAct(1, 0x40 + (regMod8(a1) << 3) + regMod8(a2));
}

/**
 * hlt (0x76)
 */
static void hlt()
{
    argcheck(a1.empty && a2.empty);
    passAct(1, 0x76);
}

/**
 * add (0x80 + 8-bit register offset)
 */
static void add()
{
    argcheck(!a1.empty && a2.empty);
    passAct(1, 0x80 + regMod8(a1));
}

/**
 * adc (0x88 + 8-bit register offset)
 */
static void adc()
{
    argcheck(!a1.empty && a2.empty);
    passAct(1, 0x88 + regMod8(a1));
}

/**
 * sub (0x90 + 8-bit register offset)
 */
static void sub()
{
    argcheck(!a1.empty && a2.empty);
    passAct(1, 0x90 + regMod8(a1));
}

/**
 * sbb (0x98 + 8-bit register offset)
 */
static void sbb()
{
    argcheck(!a1.empty && a2.empty);
    passAct(1, 0x98 + regMod8(a1));
}

/**
 * ana (0xa0 + 8-bit register offset)
 */
static void ana()
{
    argcheck(!a1.empty && a2.empty);
    passAct(1, 0xa0 + regMod8(a1));
}

/**
 * xra (0xa8 + 8-bit register offset)
 */
static void xra()
{
    argcheck(!a1.empty && a2.empty);
    passAct(1, 0xa8 + regMod8(a1));
}

/**
 * ora (0xb0 + 8-bit register offset)
 */
static void ora()
{
    argcheck(!a1.empty && a2.empty);
    passAct(1, 0xb0 + regMod8(a1));
}

/**
 * cmp (0xb8 + 8-bit register offset)
 */
static void cmp()
{
    argcheck(!a1.empty && a2.empty);
    passAct(1, 0xb8 + regMod8(a1));
}

/**
 * Return the 8-bit register offset.
 */
static int regMod8(string reg)
{
    if (reg == "b")
        return 0x00;
    else if (reg == "c")
        return 0x01;
    else if (reg == "d")
        return 0x02;
    else if (reg == "e")
        return 0x03;
    else if (reg == "h")
        return 0x04;
    else if (reg == "l")
        return 0x05;
    else if (reg == "m")
        return 0x06;
    else if (reg == "a")
        return 0x07;
    else
        err("invalid register " ~ reg);

    /* This will never be reached, but quiets gdc.  */
    return 0;
}

/**
 * Check arguments.
 */
static void argcheck(bool passed)
{
    if (passed == false)
        err("arguments not correct for mnemonic: " ~ op);
}

/**
 * Nice error messages.
 */
static void err(string msg)
{
    stderr.writeln("a80: " ~ to!string(lineno + 1) ~ ": " ~ msg);
    enforce(0);
}

/**
 * All good things start with a single function.
 */
void main(string[] args)
{
    /**
     * Make sure the user provides only one input file.
     */
    if (args.length != 2) {
        stderr.writeln("usage: a80 file.asm");
        return;
    }

    /**
     * Create an array of lines from the input file.
     */
    string[] lines = splitLines(cast(string)read(args[1]));

    /**
     * Name output file the same as the input but with .com ending.
     */
    auto split = args[1].findSplit(".asm");
    auto outfile = split[0] ~ ".com";

    /**
     * Do the work.
     */
    assemble(lines, outfile);
}

You can compile this version of our assembler and write some 8080 assembly files using these instructions and watch your assembler produce a lot more interesting object code than just the nops we had before!

Next time

Next up is teaching the assembler how to encode the fourth quarter of the opcode table. There is more variety of instructions in the first and fourth quarters compared to the second and third quarters. But after tomorrow, our assembler will be able to assemble interesting programs!

Top

RSS