Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life



[prev]
[next]

2021-04-08
Demystifying programs that create programs, part 2: Starting an assembler

All source code for this blog post can be found here.

It's time to tackle the inverse of a disassembler. It will take a lot more effort than our disassembler, but I believe we are up for it. For today, let's sit down and plan our assembler and begin coding up some boilerplate at least and see how far we get.

Recap

There are a couple of things to remember with our assembler.

First, we are going to produce an assembler for the Zilog Z80 CPU. But really, we are writing an assembler for the Intel 8080 CPU. We are benefitting from the fact that the Z80 is binary compatible with the 8080, so that means our assembler will produce machine code that the Z80 understands even though we will use the 8080 assembly language. While I am sure there will be those who disagree with me, I find the 8080 assembly language simpler to parse than the Z80 assembly language, and consequently it should be easier for us to use.

The Z80 has a different assembly language because Intel copyrighted the 8080 mnemonics.1 But whether we use the 8080 or Z80 assembly language is little more than an implementation quirk rather than a true concern for us. As we discussed during writing our disassembler, this does mean we will lose out on the Z80 extensions. I think that is fine. We can still write complex and interesting software with just the 8080 instruction set. Or, if you'd like to set an additional challenge for yourself, once we finish our assembler you can go back and devise your own syntax for the Z80 extensions and implement them in the assembler.

Second, we are not necessarily writing the assembler that someone who has decades of experience writing such tools would write. We probably will not even write the assembler you might write in undergrad if you took such a course. We are writing an assembler that someone who has little to no programming experience or knowledge can come to understand purely through engagement with the code itself and a series of blog posts. Clearly inspired by Jack Crenshaw's "Let's Build a Compiler" series, I also, as he so well put it, "intend to completely ignore the more theoretical aspects of the subject." Also like Crenshaw, we will learn by doing and we will gain some practical experience. What we will produce at the end is a correct, working assembler. That's good enough for me. Function over form.

We will write our assembler in D like we did with our disassembler. We will use a number of features of D to help make our task easier. While I am not trying to write a tutorial for D, it doesn't hurt to learn a new language if D is new to us.

Setting up our development environment

You can and should use whatever text editor or IDE you like. I am using vim, which is a vi implementation. I like that vim has a syntax highlighting mode for D so I am able to read my code a little easier. I am also using the GNU D Compiler as my compiler, but any D compiler would work. If you are on Windows, the Digital Mars D Compiler likely makes the most sense for you. Our assembler will run on any platform supported by D, so you should be able to follow along and code with us regardless of the platform you are on. I am, as always, coding on OpenBSD, but you don't have to.

Our assembler, like our debugger, will be a single file, so all I have to do is run gdc -o a80 a80.d to compile. If you are using dmd, you will need to run dmd -of=a80 a80.d. I am going to name the file with our code in it a80.d and our final executable a80. If you choose a different name, remember to alter your compiler invocation appropriately. If you look at the code linked at the top of this page, you'll notice that it's in a form most easily usable by dub, the D package manager. That organization is not necessary for us. A single file will do. That file will eventually grow to about 1300 lines, so we are going to embark on a decently significant undertaking.

Planning our assembler

What exactly does an assembler do? It reads lines of assembly language and translates them into machine (sometimes called object) code. This is the sequence of bytes that the CPU understands. The assembler will also provide some helpful features that make it easier to write code in compared to machine code. At the very least, those helpful features include: mnemonics so we do not have to remember the machine code, comments, and a label system to assign constant values and important locations in our programs. Our assembler will also make it easy to write programs for the CP/M operating system, though it will also be able to write programs that do not rely on a hosted environment.

For example, if our assembler reads the line:

	nop		; do nothing

It should write out 0x00 to the output file. Likewise, if our assembler reads the line:

	jmp	0005h

It should write out 0xc3 0x05 0x00.

Our assembler will write directly to finalized machine code. This means that we can go from assembly code to executable in one step. The CP/M systems originally broke it into two parts: an assembler that produced an intermediate code and then a linker that created the final binary. There are pros and cons to any setup, but our one-step process will have us seeing results we can run much faster so that's why I chose it.

There are some minimum requirements I think would be good for us:

The first two are truly important, as it would be difficult to argue we wrote an assembler if we cannot produce correct code from correct assembly and reject incorrect assembly. The third is effectively as important as the first two, since once you have even a moderately complex assembly program, you will want to have lots of comments to understand it. The fourth entry provides some good facilities to make our programs better: DB allows us to input any byte at any location so that we can create strings and EQU allows us to define named constants to use in place of numbers in our assembly. So if for example we need to use the same number a lot, say 0005h, we can use EQU to give it a memorable name and then use that name every time we would otherwise write 0005h. The ORG facility allows us to set the starting address for labels; with CP/M, it is nearly expected that the first line of an assembly program is ORG 100h as 0x100 is the entry point for standard CP/M executables, so we need that too.

Here is a list of things we should not worry about with our assembler:

Who cares if it is perfect and if someone else likes it? It is going to fulfill its job of taking assembly code and producing the correct machine code translation. And let's also not worry so much if it is fast. In all likelihood, it will be "fast enough." What I mean by that is we already learned when writing our disassembler that the largest program we could possibly create for the Z80 is 64 KB. With our modern machines, it would be quite difficult to inadvertently make an assembler that wasn't really quick, probably nearly instantaneous, for the vast majority of assembly programs we could write. Let's worry about getting it right first then if we want to worry about making it faster we can do that.

Let's also pull up the 8080 opcode table while we're here, since we will need it by our side while writing our assembler.

All good things start with a single function

We can borrow our main function from our disassembler with just a few additions:

import std.stdio;
import std.file;
import std.algorithm;
import std.string;

/**
 * All good things start with a single function.
 */
void main(string[] args)
{
    if (args.length != 2) {
        stderr.writeln("usage: a80 file.asm");
        return;
    }

    string[] lines = splitLines(cast(string)read(args[1]));

    auto split = args[1].findSplit(".asm");
    auto outfile = split[0] ~ ".com";

    assemble(lines, outfile);
}

We still want to error out if the user did not give us exactly one file to assemble. If you compare the line where we read in our input file with the analogous line from our disassembler, you'll notice that we add one trick: instead of reading the whole input file into one big array of bytes, we ask instead to split up the input into lines, and then create an array of lines to store those split up lines. To realize why we might want this, let's take a moment to understand the structure of an assembly program.

Assembly structure

In 8080 assembly language, each line represents an instruction. Unlike other languages like C where a statement is terminated by a semicolon and therefore can span multiple lines, that cannot happen in assembly. In assembly, the end of the line terminates the instruction. To be the utmostly pedantic, an assembly instruction has the following structure:

[label:] [op [arg1[, arg2]]] [; comment]

How should we read this? Let's start by saying that anything inside square brackets is optional. Hold on, I hear you saying, everything is in square brackets. You're right! A blank line is a valid line structure. Of course, it does nothing and outputs no machine code, but it is legal. There are three main blocks in the structure of a line of assembly, and the middle one seems a little complex. So let's start with the first and third block.

The first block, [label:] allows us to place a label on a line. When the assembler sees this, it will make a note of the name of the label and what address our program is currently at, and keep that information together. Somewhere else in the program, we could then reference that same label name and the assembler will know that what we really want is that address and perform a substitution for us.

When we declare a label, it must end with a :. When we reference the label, we do not use the colon.

This is really handy for example with loops: at the start of the loop we can have a label, let's say loop: and then at the end of the loop we can ask to go back to the beginning of the loop with something like jmp loop and the assembler will know to replace loop in the jmp statement with the address of the beginning of the loop. Neat!

As neat as that may be, it will soon present an interesting problem that we will have to overcome: what happens if we have the jmp loop reference before the loop: label is declared? We'll tackle just that problem in a bit, but let's keep it in the back of our minds for now.

The third block, [; comment] allows us to write comments. At any point in a line, we can write a semicolon and then everything from that point to the end of the line will be ignored by the assembler. I think it's effectively a requirement to have a way to write comments in assembly code and so for us the semicolon will be our comment character.

That leaves the middle block, [op [arg1[, arg2]]]. We can look at the opcode table as well as our disassembler to begin to figure out what we mean by this. If we have an opcode, it can take zero or one or two arguments. If we have two arguments, they are separated by a comma. That matches what our disassembler outputs.

We can have any combination of those three blocks on a line, as long as they appear in this pre-defined order. That is to say, if there is a label it must always come first. If there is an op it must come before a comment.

Finishing our main function

Now armed with knowledge of the structure of 8080 assembly language, let's take a look at the last three lines of the main function. We do a fancy renaming of our input file to an output file: we replace the .asm with .com. Be careful: this code finds the first instance of .asm and replaces it with .com so in theory it could produce an unexpected file name if you had multiple instances of .asm in the input file name. In practice it's good enough for me. We make this happen by using findSplit to find the first occurance of .asm in the string. This is an interesting facility in that it gives you back three strings: the first is everything before the split point, the second is the split point itself, and the third is everything after the split point. If there is only one .asm in our input file, then the first string will be everything before .asm which we then concatenate with .com.

By the way, that ~ operator is the string concatenation operator. We're going to use this facility of D quite a bit.

Finally, we call a function named assemble that begins the real work of assembling our code into binary. We give that function our array of strings and our output file name.

Starting the real work... in the next episode!

This is a good stopping point for today. Let's make sure we understand our main function, accepting that we haven't written the assemble function yet so this won't compile as-is. Let's also make sure we understand the structure of the 8080 assembly language and the requirements we set for our finished assembler.

Next time, we will begin to write our assemble function, deal with error handing, create a bunch of variables that we will need, and finally discuss the notion of passes in an assembler.

1"8080A/ 8-Bit N-Channel Microprocessor". Intel Component Data Catalog 1978. Santa Clara, CA: Intel Corporation. 1978. pp. 11–17. "All mnemonics copyright Intel Corporation 1977"

Top

RSS