Brian Robert Callahan
academic, developer, with an eye towards a brighter techno-social life
All source code for this blog post can be found here.
When we wrote our assembler we made many design choices that helped us cut down on the complexity of the assembler itself. One of those decisions was to have the assembler directly output executables. That was a good choice in that it allowed us to quickly test as we were developing: once we implemented an opcode, we could write an assembly file with that opcode, run it through our assembler, and run the resulting program without needing to do any more processing.
One signficant downside to this approach however is that we are constrained to only writing single-file assembly programs. If we want to split our programs into multiple files, or incorporate already existing libraries into our code, our assembler cannot do that. If we want to do that we need to have our assembler produce relocatable object code: code that can live "on its own" in its own independent (though not directly executable) object file and later be combined with other such files to produce an executable.
In this blog post, we will write a linker for our own object code format that is capable of producing CP/M and MS-DOS COM
file format executables, for the Intel 8080 and Zilog Z80 on CP/M and the Intel 8086 on MS-DOS.
We have two potential approaches to creating this object code. One approach would be to learn the object file format of something that already exists and target that. In the CP/M days it was Microsoft who lead the way with object file formats, creating the REL file format for their M80 assembler and L80 linker. Digital Research, Inc., who in the early days of CP/M did not have a relocatable object format of their own, was ultimately forced to use Microsoft's REL file format for their late-to-the-party RMAC relocatable assembler.
If we chose the REL file format for our assembler then we would be able to tap into all the already extant libraries in that format.
The second approach would be to design our own relocatable object code format and write our own linker that understands this format. While we would not immediately have available all the libraries already in a pre-existing format, we would be able to control that format. Plus, it might be fun to design our own object file format and write our own linker. I'm sure we'd learn something useful.
Perhaps unsurprisingly from the title of this post, and the fact that I already told you we were going to be doing this, I chose the second option.
I am going to take at least one significant liberty here. For one, we are going to make the object format simple. For two, we are not going to worry about the size of object code files. Since I am going to write the linker in D, we only need to care that the final binary is the right size. If the object code files are inefficient and bloated, so be it.
What does our linker need to do? That will help us think about how to design our object file format.
Our linker needs to:
Thinking more about these requirements of linker leads me to believe that we have two things we are interested in: symbols and everything else. But that's not quite complete. We are really dividing our conception of symbols into two as well: symbol definitions and symbol references. That leaves us with three things we need to care about: symbol definitions, symbol references, and everything else. For now, let's label those three things. Let's call definitions type 1, references type 2, and everything else type 0. All our object code and libraries should consist of these three and only these three types.
This is what our object file format will look like:
\001
.\002
.\000.
Yes, the resulting object files will be very large in comparison to the final executables they produce, especially if you have long symbol names. But again, that's not at all meaningful to us since these tools will not themselves be running on any 16-bit machines.
For the future, let's also call the \000
, \001
, and \002
bytes as we've defined them in the list above our control codes. We will use these control codes when writing our linker.
The usual: the GNU D Compiler on OpenBSD. Using vim as the objectively correct editor.
l80
Since the object file format is so well-defined and simple, I ended up writing the linker before I tweaked the assembler. I chose the incredibly original name l80
for the linker. Obviously it is not the Microsoft linker. Our assembler is named a80
and our disassembler is named d80
, so I was just following the established theme.
The linker itself is exceedingly simple. Like our assembler, we require two passes. And it follows the same organizational reasons as the assembler: our first pass will do nothing but collect all the symbol definitions and store their names and their calculated addresses in a table, with the second pass writing out the final executable, replacing all the symbol references with the calculated address for that symbol that we can then look up in the table.
To keep the linker simple in its user interface, l80
is invoked as follows:
$ l80 binary file1.obj [file2.obj ...]
Where binary
is the name you want for the final executable, sans .com
extension (l80
adds the .com
suffix automatically for you) and file1.obj
and so on are the object files that comprise the executable. You may use .lib
libraries in the list of object files.
In the simplest, where you have only a single self-contained object file named hello.obj
and you'd like to create an executable file named hello.com
, invoking l80
would look like this:
$ l80 hello hello.obj
After checking to see that we have the appropriate number of arguments, we move straight into reading all the object files and libraries identified on the command line into one big buffer. Our foreach
loop begins at 2 because we don't want to read the linker itself or the output file (which doesn't exist yet) into the buffer. We only want to read the input object files into the buffer. As we might remember, the program itself is in the argument list as the first argument.
We might also remember the findSplit function from our assembler. Recall that it returns three strings: the left of the split point, the split point itself, and the right of the split point. As we are checking for the existence of the file extension, that is the split point so if it doesn't exist that means the file doesn't end with .obj
. We then check .lib
because we might have a library. If we have neither, then we error out.
Now we are ready to find all the symbol definitions in all of this object code. This is quite easy. All we need to do is check the control codes! For the first pass, if we see a control code of \000
, then we know this has nothing to do with symbols at all so we can skip this control code byte and the following byte. We do want to make sure to increment the address counter by one though, since we need to know how far into the final executable we are in order to keep track of what the addresses of our symbols should be. For good measure too, let's also check to see if our binary is too large for a COM
file.
If the control code is \002
, then we can skip ahead to the next occurance of \002
, since as we said in our object file format design, symbol references are encased in \002
. If somehow you never see another \002
, that's an error. We do want to make sure we increment the address counter by two though, since symbols reference addresses and all addresses are 16 bits. Again, if we end up with an address counter too large, that's an error.
However, if the control code is \001
, then we have some real work to do. We continue to read characters in until we see another \001
, since according to our object file format design, symbol definitions are encased in \001
. Like with the symbol references, if we never see a closing \001
, that's an error. Additionally, if there are two \001
control codes back to back, then we don't actually have a symbol definition here and that's an error.
For one final check, we need to make sure we have not seen this symbol name before. If we have, then we have multiple definitions of the same symbol and there is no way for the linker to know which definition the user wants. Unsurprisingly, that's an error.
Finally, if everything checks out, we use the same trick we used with our assembler to create our symbol table: we have an array of structs that each hold a single symbol name and its resolved address, which happens to be whatever the value of the address counter is at this point in time. We create a new struct for this symbol and append it to the array of structs.
If for some reason you read a control code other than \000
, \001
, or \002
, that's an error.
That takes care of the first pass.
The second pass is responsible for replacing symbol references with the address we calculated in the first pass and writing out the final binary. This time, we can ignore all \001
control codes, since we already handled them in the first pass.
If we see a \000
control code, we write the following byte to the output. That's all that needs to be done for each and every \000
control code.
Encountering a \002
control code is the one that requires some work during the second pass. We read in the symbol name being referenced, make sure it isn't an empty string, and then try to find the symbol in the table we created during the first pass. If we find the symbol in the table, we output its resolved address. If we do not find the symbol in the table, that's an error.
Finally, just as a sanity check that should never happen since we checked for it during the first pass, if you see a control code that isn't \000
, \001
, or \002
, that's an error.
Believe it or not, that's it. That's our entire linker. It truly is a linker everyone can understand.
Teaching our assembler to output our new object file format looks like this. In short, we need to detect when we are working with a symbol not set to a value using the equ
pseudo-op. For these symbols not set to a value, we wrap their name in \001
when the symbol is found acting as a label, since that's our symbol definition, and wrap their name in \002
when the symbol is found as an argument to an opcode, since that's our symbol references. Finally, we write a \000
in front of all remaining bytes, since those are our "everything else."
Honestly, with an object file format this simple, you could just cat
your object files together into a library. But that felt a little too awkward for me. So I wrote an archiver that... well, it concatenates its input files. It checks that each input file ends in .obj
and automatically appends the .lib
suffix to your library. Maybe that was worth the two minutes it took to write it.
Yes, it means that our assembler can no longer itself produce directly executable code. But it is also worth considering what we have gained now: any high-level compiler writer can target our a80
assembly language and be guaranteed that module compilation will just work, since our linker can handle multiple object files. They can also be guaranteed that libraries will just work, since l80
, along with ar80
, can create and link libraries created from object files.
Additionally, if someone wrote an 8086 assembler that produced object code in our format, then l80
will also just work to create MS-DOS COM
files.
One downside to our extreme simplicity is that future high-level language compiler writers will need to implement name mangling in the compiler for all non-global symbols. Otherwise, it might potentially be all too easy for the duplication of symbol names, which would cause linker errors. At least l80
can handle symbol names of arbitrary length.
Another downside is that you do have to arrange the object files in the command line in an order that makes sense for this linear approach. The first object file must be the entry point of the program. Different arrangements of object files may work for linking your executable but will produce an organizationally different executable. You may or may not care about this.
Besides an 8086 assembler targeting MS-DOS, there is one additional feature I think is necessary for our linker. Right now, the linker cannot determine if a symbol is unused in the final program and then remove that symbol from the executable. This can be done by tracking references during pass one and then skipping over all global symbols without references during pass two.
Linkers, like we saw with assemblers, do not need to be black boxes of magic that only the preordained can ever hope to understand. At least for simple executable formats such as CP/M and MS-DOS COM
files that only need symbol resolution, a linker of your own creation can be a nice afternoon project. While I didn't exactly time it, I am pretty sure it took me longer to write this blog post explaining the linker than it did for me to actually write the linker.
See if you can do better than me in creating your own object file format and creating your own linker for it.
It is certainly possible to implement our assembler and linker as native binaries as well, though we would have to devise new strategies for reading in and working with files, as storing everything in big buffers will not work on 16-bit systems. But porting our utilities to 16-bit systems can be done.
Finally, we're pretty close to having written a complete development environment. The last big tool we are missing, besides lots of high-level language compilers, is a debugger. A debugger presents some very interesting challenges, since we will need it to be a native binary and be a binary that can run and monitor other binaries. I suppose we will tackle it some day, though I don't know when.
If you're interested in linkers and are interested in reading about creating a production-quality ELF linker, check out Ian Lance Taylor's blog series on the GNU gold linker.