Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life



[prev]
[next]

2021-04-07
Demystifying programs that create programs, part 1: A disassembler

All source code for this blog post can be found here.

People appear to enjoy the blog posts about me porting different compilers to OpenBSD. What I would like to do for the next couple of posts beginning with this one is to take a step back and de-complexify these programs. Both the D compiler and the GNU Modula-2 compiler are highly complex pieces of software. But at their core they are the exact same thing: a program that can create programs. We need not explore something so complex in order to learn how to create a program of our own that creates programs. In this series of blog posts, we will create two programs that will help us demystify programs that create programs: first, in this blog post, we will create a disassembler, or a program that reads a program and produces a higher-level representation (assembly); second, in a couple of subsequent blog posts, we will create an assembler, a program that understands that higher-level assembly language and produces a program from it.

Enter the Zilog Z80

It is important to me that we choose a real CPU that you can really purchase today brand new. 8-bit CPUs are well-suited for this. Some people like the MOS Technology 6502, which powered a lot of computers from the 1980s, including the Apple II, Nintendo Famicom, and Commodore 64, among many others. I like the Zilog Z80, which was the CPU found in machines such as the Sinclair ZX Spectrum, Sega Master System, and (in a slightly modified form) Nintendo Game Boy. And like I mentioned, you can still purchase a Z80 of your own brand new. The Z80 also has the benefit of being binary compatible with the slightly earlier Intel 8080 CPU. This binary compatibility will make our lives much easier, as we will specifically target the Intel 8080 for our assembler and disassembler. And in doing so, we will know that our Z80 will understand what we mean. The Z80 is a superset of the Intel 8080 but we will not worry about the Z80 extensions, as they are not necessary for creating complex and interesting programs with our assembler.

Of course, you need not buy a Z80 in order to participate. There are many emulators for 8-bit CPUs out there. I like tnylpo for the Z80, as it gives you a complete CP/M environment as well. You can probably find a package of tnylpo in your favorite package manager.

The Intel 8080 instruction set

The Intel 8080 instruction set contains 256 instructions, though some are duplicates. We will avoid the duplicates in our assembler, since the Z80 uses those to implement its extensions. It turns out there are 245 non-duplicate instructions in the Intel 8080 instruction set. There is a handy website that provides a color-coded table of the entire instruction set. We will keep this website by our side as we code our assembler and disassembler.

An instruction can be one, two, or three bytes in size. It will always be the case when an instruction is greater than one byte, the additional bytes are a number written in little-endian format. Let's read an instruction cell from the website to learn how we can tell how many bytes each instruction is. Let's begin with the first instruction: nop. This instruction is encoded as 0x00, the y-axis gives us the number in the "10s" digit and the x-axis gives us the number in the "1s" digit. That cell gives us the following information:

   NOP
  1   4
- - - - -

NOP is the mnemonic that the Intel 8080 assembly language uses to represent the 0x00 instruction. The 1 represents the size of this instruction and the 4 represents the number of clock cycles the instruction takes to execute. We don't need to worry about clock cycles though it would be important to know if you were going to write an emulator. The five dashes on the bottom represent which of the bits of the CPU flags register get altered by this command, though again that would only be useful to us if we were writing an emulator.

What we need to do then to write our disassembler is to create a program that has all the mnemonics and instruction sizes, and maps them to the appropriate byte.

Writing our disassembler: Selecting an implementation language

As assembly language is only a high-level language in comparison to raw bytes, we definitely do not want to write our programs themselves in assembly, though that was once the reality for programmers. Let's choose a high-level language to implement our programs in. I am going to choose D. We could choose C, but I think D has some extra facilities on top of C that will make our lives easier, especially when we write the assembler. My goal isn't necessarily to create a D tutorial out of this, though I suppose some of that will be inevitable. That's fine. Any high-level language would do here and I had to pick one, so I'm choosing D.

Some boilerplate

Let's set up the basics for our disassembler. We can probably get away with a simple program with a single main function and a struct that represents the instruction set table. Let's start with some D main function boilerplate:

import std.stdio;
import std.file;

void main(string[] args)
{
    if (args.length != 2) {
        stderr.writeln("usage: d80 file.com");
        return;
    }

    ubyte[] program = cast(ubyte[])read(args[1]);
}

In D, main functions are void unlike in C, where the main function is an int. In practice, this does not matter for us. Also unlike C, the main function in D takes zero or one argument, and if it takes one argument, that argument is an array of strings. We want to begin our program by ensuring that the user has provided exactly one program to disassemble, in other words, we have an array with two items (the first item being the program name itself). If we do not, let's give the user a hint towards getting it right next time and then exit, since we may not be able to do any meaningful work.

If the user did invoke the disassembler correctly, we will move straight into reading in the program into memory. Programs for the Z80 are pretty small, with 64 KB being the maximum size. As we are most likely on a 64-bit system (or at worst, a 32-bit system), we have more than enough space in memory to hold even the largest Z80 program with ease.

We use the read function from the D standard library to do this. According to the documentation, the read function returns a void[] and gdc, the D compiler I am using, did not know how to automatically convert that into a ubyte[] so I added the cast to tell the compiler what to do. There may be a more idiomatic D way of doing this, but I like the cast solution. Perhaps my C coder is showing.

Why did I choose a ubyte[] to store our program in? Well, let's think about what a program really is. A program is just a sequence of bytes that when interpreted by the CPU performs the actions we want to undertake. We are going to do that very same work with our disassembler, except we will stop short of performing the actions requested and instead output the assembly language representation of those bytes.

Disassembler logic

Let's add the disassembler logic to our program now:

import std.stdio;
import std.file;

void main(string[] args)
{
    /* Make sure the user gave us exactly one file to read.  */
    if (args.length != 2) {
        stderr.writeln("usage: d80 file.com");
        return;
    }

    ubyte[] program = cast(ubyte[])read(args[1]);

    auto addr = 0;
    while (addr < program.length) {
        writef("%04x\t%s", addr, insn[program[addr]].mnemonic);
        if (insn[program[addr]].size > 1) {
            if (insn[program[addr]].size == 3)
                writef("%02x", program[addr + 2]);
            writef("%02xh", program[addr + 1]);
        }
        writeln();

        addr += insn[program[addr]].size;
    }
}

Let's take a look at our logic. We keep track of where we are in the program with the variable addr, which represents the current address of the instruction we are on. As long as we have more to disassemble, we do the following:

As the Z80 is a little endian CPU, it is important that if our instruction is three bytes long, we print out the third byte before the second byte, since that will "humanize" the number into the big-endian format that we humans "natively" understand. I also stick an h at the end of the number, since that reminds us that it is a hexadecimal number.

All we have left to do now is take the information from the instruction set website and teach it to our program.

Instruction table

I've gone ahead and written out the table. But you may want to try it out for yourself and see if you can translate only the needed information from the website to your disassembler. In any event, here is our completed program, now with the instruction table:

import std.stdio;
import std.file;

/**
 * Strings and instruction sizes for Intel 8080.
 */
struct i80 {
    string mnemonic;    /// Instruction name
    ubyte size;         /// Size of instruction in bytes
};

/**
 * Table of Intel 8080 instructions.
 */
immutable i80[] insn = [
    { "nop",       1 },
    { "lxi\tb, ",  3 },
    { "stax\tb",   1 },
    { "inx\tb",    1 },
    { "inr\tb",    1 },
    { "dcr\tb",    1 },
    { "mvi\tb, ",  2 },
    { "rlc",       1 },
    { "nop",       1 },
    { "dad",       1 },
    { "ldax\tb",   1 },
    { "dcx\tb",    1 },
    { "inr\tc",    1 },
    { "dcr\tc",    1 },
    { "mvi\tc, ",  2 },
    { "rrc",       1 },
    { "nop",       1 },
    { "lxi\td, ",  3 },
    { "stax\td",   1 },
    { "inx\td",    1 },
    { "inr\td",    1 },
    { "dcr\td",    1 },
    { "mvi\td, ",  2 },
    { "ral",       1 },
    { "nop",       1 },
    { "dad\td",    1 },
    { "ldax\td",   1 },
    { "dcx\td" ,   1 },
    { "inr\te",    1 },
    { "dcr\te",    1 },
    { "mvi\te, ",  2 },
    { "rar",       1 },
    { "nop",       1 },
    { "lxi\th, ",  3 },
    { "shld\t",    3 },
    { "inx\th",    1 },
    { "inr\th",    1 },
    { "dcr\th",    1 },
    { "mvi\th, ",  2 },
    { "daa",       1 },
    { "nop",       1 },
    { "dad\th",    1 },
    { "lhld\t",    3 },
    { "dcx\th",    1 },
    { "inr\tl",    1 },
    { "dcr\tl",    1 },
    { "mvi\tl, ",  2 },
    { "cma",       1 },
    { "nop",       1 },
    { "lxi\tsp, ", 3 },
    { "sta\t",     3 },
    { "inx\tsp",   1 },
    { "inr\tm",    1 },
    { "dcr\tm",    1 },
    { "mvi\tm, ",  2 },
    { "stc",       1 },
    { "nop",       1 },
    { "dad\tsp",   1 },
    { "lda\t",     3 },
    { "dcx\tsp",   1 },
    { "inr\ta",    1 },
    { "dcr\ta",    1 },
    { "mvi\ta, ",  2 },
    { "cmc",       1 },
    { "mov\tb, b", 1 },
    { "mov\tb, c", 1 },
    { "mov\tb, d", 1 },
    { "mov\tb, e", 1 },
    { "mov\tb, h", 1 },
    { "mov\tb, l", 1 },
    { "mov\tb, m", 1 },
    { "mov\tb, a", 1 },
    { "mov\tc, b", 1 },
    { "mov\tc, c", 1 },
    { "mov\tc, d", 1 },
    { "mov\tc, e", 1 },
    { "mov\tc, h", 1 },
    { "mov\tc, l", 1 },
    { "mov\tc, m", 1 },
    { "mov\tc, a", 1 },
    { "mov\td, b", 1 },
    { "mov\td, c", 1 },
    { "mov\td, d", 1 },
    { "mov\td, e", 1 },
    { "mov\td, h", 1 },
    { "mov\td, l", 1 },
    { "mov\td, m", 1 },
    { "mov\td, a", 1 },
    { "mov\te, b", 1 },
    { "mov\te, c", 1 },
    { "mov\te, d", 1 },
    { "mov\te, e", 1 },
    { "mov\te, h", 1 },
    { "mov\te, l", 1 },
    { "mov\te, m", 1 },
    { "mov\te, a", 1 },
    { "mov\th, b", 1 },
    { "mov\th, c", 1 },
    { "mov\th, d", 1 },
    { "mov\th, e", 1 },
    { "mov\th, h", 1 },
    { "mov\th, l", 1 },
    { "mov\th, m", 1 },
    { "mov\th, a", 1 },
    { "mov\tl, b", 1 },
    { "mov\tl, c", 1 },
    { "mov\tl, d", 1 },
    { "mov\tl, e", 1 },
    { "mov\tl, h", 1 },
    { "mov\tl, l", 1 },
    { "mov\tl, m", 1 },
    { "mov\tl, a", 1 },
    { "mov\tm, b", 1 },
    { "mov\tm, c", 1 },
    { "mov\tm, d", 1 },
    { "mov\tm, e", 1 },
    { "mov\tm, h", 1 },
    { "mov\tm, l", 1 },
    { "hlt",       1 },
    { "mov\tm, a", 1 },
    { "mov\ta, b", 1 },
    { "mov\ta, c", 1 },
    { "mov\ta, d", 1 },
    { "mov\ta, e", 1 },
    { "mov\ta, h", 1 },
    { "mov\ta, l", 1 },
    { "mov\ta, m", 1 },
    { "mov\ta, a", 1 },
    { "add\tb",    1 },
    { "add\tc",    1 },
    { "add\td",    1 },
    { "add\te",    1 },
    { "add\th",    1 },
    { "add\tl",    1 },
    { "add\tm",    1 },
    { "add\ta",    1 },
    { "adc\tb",    1 },
    { "adc\tc",    1 },
    { "adc\td",    1 },
    { "adc\te",    1 },
    { "adc\th",    1 },
    { "adc\tl",    1 },
    { "adc\tm",    1 },
    { "adc\ta",    1 },
    { "sub\tb",    1 },
    { "sub\tc",    1 },
    { "sub\td",    1 },
    { "sub\te",    1 },
    { "sub\th",    1 },
    { "sub\tl",    1 },
    { "sub\tm",    1 },
    { "sub\ta",    1 },
    { "sbb\tb",    1 },
    { "sbb\tc",    1 },
    { "sbb\td",    1 },
    { "sbb\te",    1 },
    { "sbb\th",    1 },
    { "sbb\tl",    1 },
    { "sbb\tm",    1 },
    { "sbb\ta",    1 },
    { "ana\tb",    1 },
    { "ana\tc",    1 },
    { "ana\td",    1 },
    { "ana\te",    1 },
    { "ana\th",    1 },
    { "ana\tl",    1 },
    { "ana\tm",    1 },
    { "ana\ta",    1 },
    { "xra\tb",    1 },
    { "xra\tc",    1 },
    { "xra\td",    1 },
    { "xra\te",    1 },
    { "xra\th",    1 },
    { "xra\tl",    1 },
    { "xra\tm",    1 },
    { "xra\ta",    1 },
    { "ora\tb",    1 },
    { "ora\tc",    1 },
    { "ora\td",    1 },
    { "ora\te",    1 },
    { "ora\th",    1 },
    { "ora\tl",    1 },
    { "ora\tm",    1 },
    { "ora\ta",    1 },
    { "cmp\tb",    1 },
    { "cmp\tc",    1 },
    { "cmp\td",    1 },
    { "cmp\te",    1 },
    { "cmp\th",    1 },
    { "cmp\tl",    1 },
    { "cmp\tm",    1 },
    { "cmp\ta",    1 },
    { "rnz",       1 },
    { "pop\tb",    1 },
    { "jnz\t",     3 },
    { "jmp\t",     3 },
    { "cnz\t",     3 },
    { "push\tb",   1 },
    { "adi\t",     2 },
    { "rst\t0",    1 },
    { "rz",        1 },
    { "ret",       1 },
    { "jz\t",      3 },
    { "jmp\t",     3 },
    { "cz\t",      3 },
    { "call\t",    3 },
    { "aci\t",     2 },
    { "rst\t1",    1 },
    { "rnc",       1 },
    { "pop\td",    1 },
    { "jnc\t",     3 },
    { "out\t",     2 },
    { "cnc\t",     3 },
    { "push\td",   1 },
    { "sui\t",     2 },
    { "rst\t2",    1 },
    { "rc",        1 },
    { "ret",       1 },
    { "jc\t",      3 },
    { "in\t",      2 },
    { "cc\t",      3 },
    { "call\t",    3 },
    { "sbi\t",     2 },
    { "rst\t3",    1 },
    { "rpo",       1 },
    { "pop\th",    1 },
    { "jpo\t",     3 },
    { "xthl",      1 },
    { "cpo\t",     3 },
    { "push\th",   1 },
    { "ani\t",     2 },
    { "rst\t4",    1 },
    { "rpe",       1 },
    { "pchl",      1 },
    { "jpe\t",     3 },
    { "xchg",      1 },
    { "cpe\t",     3 },
    { "call\t",    3 },
    { "xri\t",     2 },
    { "rst\t5",    1 },
    { "rp",        1 },
    { "pop\tpsw",  1 },
    { "jp\t",      3 },
    { "di",        1 },
    { "cp\t",      3 },
    { "push\tpsw", 1 },
    { "ori\t",     2 },
    { "rst\t6",    1 },
    { "rm",        1 },
    { "sphl",      1 },
    { "jm\t",      3 },
    { "ei",        1 },
    { "cm\t",      3 },
    { "call\t",    3 },
    { "cpi\t",     2 },
    { "rst\t7",    1 }
];

void main(string[] args)
{
    /* Make sure the user gave us exactly one file to read.  */
    if (args.length != 2) {
        stderr.writeln("usage: d80 file.com");
        return;
    }

    ubyte[] program = cast(ubyte[])read(args[1]);

    auto addr = 0;
    while (addr < program.length) {
        writef("%04x\t%s", addr, insn[program[addr]].mnemonic);
        if (insn[program[addr]].size > 1) {
            if (insn[program[addr]].size == 3)
                writef("%02x", program[addr + 2]);
            writef("%02xh", program[addr + 1]);
        }
        writeln();

        addr += insn[program[addr]].size;
    }
}

Remember that we only need the instruction mnemonic and size from the table. The other information, while useful if we were writing an emulator, is not useful for a disassembler, so I omitted it.

Our instruction table is an array of a simple struct that contains the string of the mnemonic to print and the instruction size so we can use that informaton in our disassembler logic. I decided to mark the array as immutable since it will never change during execution. If it does, I fear we may have bigger problems on our hands!

With each byte we read, our disassembler logic will take the current byte and go look up in the table which mnemonic corresponds to that byte and print it. And then it checks the size of that instruction and if it is greater than one, will print either the next byte or next two bytes (translated from little-endian order) depending on if the instruction size is two or three.

With that, our disassembler is complete. It can disassemble any valid Intel 8080 program, or any valid Z80 program that uses only the baseline Intel 8080 instructions. Here is a little program so that you can see the diassembler working.

Caveats

One thing our disassembler cannot do is understand the difference between instructions and data. That somewhat makes sense. The way our CP/M programs exist on disk strips them of that context. More complex file formats such as ELF have ways of retaining that contextual information. But not our CP/M programs. You'll notice in the example program there is a Hello World! string in there. You'll see it right away if you load the program into a hex editor. But our disassembler prints out what would be the corresponding instructions. We have to live with that limitation. However, this could lead to inaccurate disassembler output depending on how those data bytes are arranged.

Looking ahead

We could take this disassembler and extend it to understand all the Z80 instructions. I have in fact done that here if you want to see one potential solution to that problem. There is more logic and a lot more tables to fill in if you would like to try that. But the overall technique is exactly the same. There is just more of it.

In fact, this straightforward technique could be adapted to write a disassembler for (nearly) any other CPU in existence.

Unfortunately, going in the opposite direction, from assembly language to program, is not nearly as straightforward and simple as our disassembler. But that does not mean our assembler needs to be complex. There are strategies we can use to make our assembler effective and understandable for newcomers. We probably will not be able to cover the entire assembler in one blog post. Stay tuned as we continue to demystify programs that create programs.

Top

RSS