Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life



[prev]
[next]

2021-06-06
Introducing 8088ify: The CP/M to MS-DOS assembly translator

Today I cut the first release of 8088ify, a program that translates Intel 8080 CP/M assembly language to Intel 8086 (8088) MS-DOS assembly language.

While we already wrote an assembler together, that assembler targeted the 8080 CPU. We could run our 8080 programs on our machines but only by way of an emulator that implemented a Zilog Z80 and all the supporting hardware in software that our programs then ran inside. Now, CP/M is not a particularly complex operating system and the 8080 is not a particularly complex CPU, and our modern CPUs are thousands of times faster, so any programs we wrote and assembled with our assembler would run effectively at full speed in an emulator. Another consideration with our assembler is that because it was written in D, our assembler is a cross assembler. That is to say, the assembler runs on a platform other than the one that the assembler outputs programs for. In our case, our assembler runs on any platform that D supports, which includes our OpenBSD development platform. But not CP/M itself.

Unless you happen to own an NEC V20, an emulator is the best we can do to run the programs assembled with our assembler on an x86 machine. And that's fine if that's what we want. But what if we wanted to run our programs natively on our x86 machines? We would need another tool, likely an assembler for x86 and x86_64 machines like nasm. But let's say we've written a lot of fun 8080 programs with our assembler and while we want to run those programs natively on our machine, we do not want to invest the time to manually read through our assembly line by line and rewrite the program in 8086 mnemonics.

It turns out this problem really did exist during the transition from 8080 to 8086 machines in the early 1980s. And we do not need to undergo the tedious work of hand translating 8080 to 8086 assembly line by line. We can use the skills we have already developed when we wrote our assembler to write a new program: a transpiler. A transpiler, also called a source-to-source translator, is exactly what it sounds like: it reads in source code in one language and outputs equivalent source code in another language. Let's do that. Let's write an 8080 assembly to 8086 assembly transpiler.

An anticipated problem

Intel anticipated the need to easily translate from 8080 assembly to 8086 assembly and had that in mind during the 8086 design process. Intel also produced documentation including 8080 to 8086 opcode translation tables. There are even instructions such as lahf and sahf that seem purpose-built for just this translation.

Indeed, companies produced 8080 to 8086 translators. Digital Research, Inc. created a translator named XLT86 that translated CP/M-80 assembly to CP/M-86 assembly. Intel produced their own tool named CONV86 for translation, though anecdotal evidence suggests CONV86 was not so good.

Even Microsoft had their own translator. These days, it is open source and available in the MS-DOS repository on GitHub.

I am sure there were other 8080 to 8086 translator tools as well.

Designing and shortcuts

Now that we have a sense of history, let's plan out our translator. I want the following things for our translator:

I think that is a good set of requirements. One of the things I want to do is submit our translator to PCjam 2021, and its rules state that the program ought (but is not required) to run on a real IBM PC. That is why the third, fourth, and fifth points are on the list. The first two requirements focus on the user experience. We are expecting this tool to have been used by home computer users and companies that provided software to home computer users: they should be able to quickly and easily take all their CP/M software and produce MS-DOS software. The final two optional requirements would be nice to have.

To fulfill all these requirements, I chose C as the programming language to write our translator in. I chose nasm as the assembler to target with our translator. While nasm has not made any 16-bit DOS releases in quite some time, they were still hosting their 16-bit DOS binaries on Sourceforge. I downloaded the binaries for nasm-0.98.31. After testing to make sure they did in fact run on an 8086 machine, I placed the nasm binaries in the translator's repository so that users would have everything they need to produce binaries.

I also chose to take one very significant shortcut for our translator: our translator will assume all 8080 assembly presented to it is valid. Again, we are thinking about the homebrewer or company that already has working software and just needs to get up and running on their new CPU. Invalid 8080 assembly will translate to something and it will be the responsibility of the programmer to figure out what that invalid 8080 assembly should be. We will guarantee that all valid 8080 assembly will produce valid 8086 assembly. We are in effect writing a glorified pattern matcher since we do not perform any syntactic or semantic analysis of the 8080 assembly. I suppose we could have written the translator in another language that is designed for pattern matching but I don't know of any that run on CP/M.

Setting up the developer environment

We will use the usual suspects: OpenBSD and clang for Unix development and writing the program, Open Watcom v2 on DOSBox-X for MS-DOS development, and the Amsterdam Compiler Kit for cross compiling to CP/M from OpenBSD. I already had my development environment set up from the previous blog post.

Writing the translator

The flow of work for the translator looks like this:

+-------+
|       |
| Start |
|       |
+-------+
    |
    +--+
       |
       V
+--------------+       +----------------+       +--------------+
| Read in line |       |                |       | Translate to |
|              |------>| Parse 8080 asm |------>|              |
| of 8080 asm  |       |                |       | 8086 asm     |
+--------------+       +----------------+       +--------------+
       ^                                               |
       |                                               |
       |                                               |
       |                                               V
       |                                          -----------
       |                    No                   /           \
       +----------------------------------------| End of asm? |
                                                 \           /
                                                  -----------
                                                       |
                                                       | Yes
                                                       |
                                                       V
                                                    +-----+
                                                    |     |
                                                    | End |
                                                    |     |
                                                    +-----+

We can easily see that there are only three real steps in our translator: we read in a line of assembly, we parse that line into its constituent parts, and then we look up the translation and output that translation. No need to make life difficult for us. We should already know how to parse lines from our assembler. We can take that nearly wholesale, at least, after manual translation for D to C. Reading in lines unfortunately is a bit more complex than the one-liner it is in D. But it is not too difficult.

Because I did not know how well or poorly MS-DOS and CP/M would deal with malloc'ing and free'ing memory, I went forward with all hardcoded buffers for holding strings. Theoretically, one could write a line too large for our translator to read in. But if that is true, then it will also be too big for CONV86, as I picked buffer sizes significantly larger than Intel. Our translator will read up to 255 characters per line and convert whatever is there in those 255 or fewer characters. If there is a line longer than 255 characters, it is the responsibility of the programmer to fix it themselves.

Sure, I am technically "wasting" space with these hardcoded buffers. But on CP/M and MS-DOS we do not have shared libraries so it could very easily be the case that the code for malloc and free are larger than the size of our buffers. In any case, our translator is just under 19 KB in size when compiled with Open Watcom v2 so I am not very worried about running out of memory.

Once parsed, we compare the opcode we read on this line with a list of 8080 opcodes. Once we find our string match, we execute a function of the same name that mechanically translates this 8080 opcode to its 8086 equivalent. That's it. That's all we need to do. It's surprisingly simple. But I like surprisingly simple. I used both the Intel and DRI translation tables to write the translation functions. For added niceties, we will also print all the comments in the 8080 source code in with their 8086 translations. While comments might not make complete sense for an 8086, it helps the programmer know where they are in the new assembly code.

Special cases

There are a few special cases we need to look out for though. In order to properly translate from CP/M to MS-DOS we need to look out for CP/M calls to 0005h which is the BDOS entry point. We can think of it as system calls. Fortunately, MS-DOS has direct compatibility with CP/M here: many of the BDOS calls have the same number as the MS-DOS system calls. It may not always work so our translator will insert a comment immediately after every call and jmp instruction alerting the user to double check that the call or jmp is correct. We do the same for calls to 0000h, which, along with rst 0, which we also special case, is how a CP/M program is terminated.

With that, our translator is complete.

Testing the translator

Now that the 8088ify was complete, I needed a test program to make sure that it was not just complete but correct. When writing a80, I found a test program that was designed to check that all of the 8080 opcodes worked correctly. That sounds perfect so I added it to the 8088ify repository. The test program is copyrighted but it is mentioned that the program was donated to a CP/M user group. If the copyright holder wants the program removed from my repository, please contact me and I will do so. I am hoping that this is an acceptable use of the test program.

Now, since the CP/M assembler assembly language and NASM assembly language differ slightly, we have to be on the lookout for any language features that cannot translate over (at least not using our pattern matching method). Just to check, I ran nasm on the 8086 translation unmodified. Sure enough, nasm did issue a few errors.

There were only two incompatibilities in the test program. First, the test program has a label named CPU but nasm appears to have a CPU reserved word. That's an easy fix: I changed the CPU label to eCPU, which is not a reserved word. Second, there is some arithmetic and logic performed on a label address in the second argument of some mvi instructions, which nasm does not support. Fortunately, this is also a very easy manual fix: the majority of these instructions are splitting a 16-bit address into its two 8-bit halves and then placing each half in the two halves of the 16-bit H register. That left two instructions where half of a 16-bit address was being placed in the A register. But I was able to solve that by turning them into mov al, b_ (one is bh and the other is bl) instructions.

Here is the 8086 assembly program immediately after translation and here is the finalized 8086 assembly program. The diff between the two is here and as we can see is quite minimal. That bodes well for the actual usability of 8088ify.

Now was the moment of truth: we have a binary. Time to run the program on MS-DOS and see what happens. I was truly amazed when the message CPU IS OPERATIONAL appeared on screen, suggesting that the test program ran successfully. Getting things right the first time never happens to me. To make sure it wasn't a fluke, I intentionally changed a value in the finished translation assembly code and reassembled it. This should cause a test to fail. Sure enough, a failure was reported with this intentionally broken program.

While I suppose I cannot say with 100% certainty that we are always doing the right thing all the time, it is close enough and good enough for me. We have a working translator.

Conclusion

I feel pretty confident in saying that if you write a program that targets our a80 assembler then it has a good chance of just working on 8086 after being run through 8088ify. Our translator is not perfect but as we saw, manual finishing can also be quick and easy.

The neat thing about our translator is that, if you are running an x86 or x86_64 machine, you can run your translated programs directly on your machine sans emulator. You would have to boot into MS-DOS or FreeDOS, but you could do it. Of course, emulation works fine for the translated programs as well.

If you use 8088ify and find any bugs, feel free to open an Issue (or Pull Request, if you have a diff) on GitHub.

Addendum

Thank you to whoever posted 8088ify to Hacker News. I don't have an account there and am always surprised when people think the things I do are worth sharing.

Top

RSS