Brian Robert Callahan
academic, developer, with an eye towards a brighter techno-social life
All source code for this blog post can be found here.
We wrote a Brainfuck compiler back in April. That one developed a simple architecture to support multiple backends. Today, I want to write a new Brainfuck compiler. One that can only run on a single system, OpenBSD/amd64. And one that is as small as I can possibly make it. I broke into the sub-256 bytes club with my compiler, so I thought it was worthwhile sharing and talking about how I was able to pull it off.
While this works on my machine, there may be subtle differences about how your machine works that cause this compiler not to work. Make sure to check your machine's behavior if there are any differences!
We are optimizing exclusively for size. That means that we may make decisions that are less performant than other ways of writing a Brainfuck compiler. That's OK.
The assembly code will be written using AT&T syntax. Last time I had a blog post with assembly code, someone complained that it was not Intel syntax. I am not much bothered by either and I teach my undergraduates both syntaxes because I think it is important to be able to read both. If you are an Intel syntax person, take this post as an opportunity to improve your AT&T syntax skills.
Let's quickly review the syntax of Brainfuck so we know what we need to implement. The table below presumes that we are translating Brainfuck to C, have a global array acting as the Brainfuck tape, and have a pointer, *p
, that is intialized to the beginning of the array.
Symbol | C equivalent |
---|---|
< |
--p; |
> |
++p; |
- |
--*p; |
+ |
++*p; |
, |
*p = getchar(); |
. |
putchar(*p); |
[ |
while (*p) { |
] |
} |
I think the only way we stand a chance at creating a compiler as small as possible is to use hand-optimized assembly. Those precious bytes aren't going to save themselves!
We will be using the GNU binutils for our assembler and the OpenBSD system linker, which is LLVM lld, for our linker. I am using the GNU assembler from ports, which is a much newer version of the assembler. This newer assembler is a little bit smarter when it comes to encoding and that ends up saving a byte for us down the line. I'll point it out when we get there.
Our compiler will compile Brainfuck to C. I chose C because it will allow us to have some of the smallest output code, and by extension the shortest strings needed to have built-in to our compiler. We will use the table above for our translation.
Our compiler therefore begins as follows:
.text .globl main main: .LSprologue: .ascii "a[65536],*p=a;main(){" # 21 .LS: .ascii "--p;" # 4 .ascii "++p;" # 4 .ascii "--*p;" # 5 .ascii "++*p;" # 5 .ascii "*p=getchar();" # 13 .ascii "putchar(*p); " # 13 .ascii "while(*p){" # 10 .ascii "return 0;" # 9 .ascii "}" # 1
The top is some boilerplate to inform the toolchain that we have a global symbol named main and it (and everything else, for that matter) lives in the text section of our binary.
Beginning with .LSprototype
is our strings. Labels beginning with .L
are considered local labels for ELF targets like my machine. After each string I put a comment with the number of characters in the string. We will use the OpenBSD kernel facilities for reading and writing, and the write(2)
system call has a parameter for number of bytes to write. Yes, we could write a strlen(3)
function, and in fact we did that for our echo program. However, writing such a function would take too many bytes compared to simply shoving the correct value into the register.
The .LSprologue
string is separate from the rest of the .LS
strings because the way I organized the compiler the prologue is used directly and for all other strings we index into the string starting at .LS
.
You'll also notice I am using the .ascii
pseudo-op rather than the .asciz
pseudo-op. This means that there is no NUL-termination of any of these strings, so they are not C strings. Since we are providing the exact number of bytes to write with each write(2)
call, we can get away with this and save a number of bytes in the process.
We are also leaving them in the .text
section. That technically means they are code. We could place them in the .rodata
section but since these string are never written to, I think it is fine.
We can also see some more byte-shaving at work. While we are compiling to C, we didn't specify which standard we were targeting. We are going very old by having implicit int
s everywhere. We are also going to be implicitly declaring getchar(3)
and putchar(3)
which is invalid in C99 and later, according to clang. So we will target C89 and/or K&R C.
As a result of the implicit int
s the cells on the Brainfuck tape are 32-bits in size. There is no requirement for any particular cell size, so this is acceptable from a Brainfuck perspective.
I also made the getchar and putchar strings equal length, even though the putchar string is actually one character shorter. We will spend a byte now to save several bytes later.
Our parser needs to read in the program one character at a time and process each character. If it is one of the Brainfuck characers, we output the appropriate translation, otherwise we move on to the next character. We are going to take advantage of the fact that we can read from stdin
and write to stdout
without having to set anything up outselves; these facilities are set up for us by our environment.
We will also need to look up the syscall numbers for read(2)
and write(2)
, which are 3 and 4.
We must read in a character before we can parse it and write anything. Our read can look like this:
.Lparse: movb $3, %al # Load read(2) system call movb $1, %dl # Read one character xorl %edi, %edi # Read from stdin leaq (%rsp), %rsi # Read into top of stack syscall # read(0, (%rsp), 1); incl %edi # Set %edi to 1, for write
We can take advantage of some assumptions on our system for this to work. On our system all general-purpose registers are initialized to 0 by the environment before program start. The movb
instructions require only 2 bytes but do not 0-out the remainder of the register. If there was something already in %rax
or %rdx
above the lowest byte, it would remain as-is. But since our environment set these registers to 0, we are OK. These movb
instructions are 2 bytes each.
We also chose to use xorl
instead of movb
for the %rdi
register. This is because you cannot use movb
on %rdi
since it is at its smallest a 16-bit register. Since we are making it 0, xor'ing any 32-bit register with itself is the smallest way to do this at 2 bytes.
The leaq
instruction is expensive at 4 bytes. It's actually one of the smallest versions of this instruction and we will see a 7-byte version later. Finally, we need to call the syscall for 2 bytes. That's 12 bytes total to get a character from stdin
.
We are also assuming that our environment set up some stack for us and that our stack pointer points to something reasonable. That happens to be true. As we learned before, the environment does contain our argc
and argv
variables and that is the case here too. So we have a reasonable stack pointer.
While it is not directly related to reading in a character, it turns out that %rdi
stays the same across read(2)
calls, so we can increment it to prepare for an upcoming write(2)
call. Incrementing is 2 bytes, which is the smallest way to turn a 0 into a 1 for us. Even in the case where we do not actually end up writing anything, for example because we read in a comment character, it does not much matter since we xor the register before the read(2)
call every time.
EOF
Now would be a good time to check if we are at the end of the file, since if we are we have finished our work and should exit the program. Fortunately, read(2)
returns the number of bytes read in %rax
. If that number is 1, then we are not yet finished. Otherwise, if it is 0 or -1 then we are finished. A 0 means end of file and a -1 means an error occured. We don't actually care if we reached end of file or encountered an error. Both of those mean it is time to exit for us.
Our machine uses the System V AMD64 ABI calling convention. This dictates how our registers are used and how registers are saved or not across function calls, including syscalls. For us, we need to know that %rax
will contain the return value of a call, so we should presume that it will change after each call. Additionally, %rax
contains the syscall number prior to making the syscall. The calling convention also says that the %rbx
, %rsp
, %rbp
, and %r12
-%r15
registers must be restored to their original value by the callee if they are used. That means as far as we are concerned, we can use those registers across calls and presume that their value will stay the same. Any other registers may be clobbered by calls. We must also respect the registers used for call parameters. Since read(2)
and write(2)
use three parameters, we must make sure to use %rdi
, %rsi
, and %rdx
for those parameters, like we did with our read(2)
call above.
It happens to be the case that that %rdi
and %rdx
do not get clobbered by read(2)
and write(2)
calls on our machine, which we will use to our advantage.
EOF
, part 2Now we can check for EOF
after each read:
cmpl %edx, %eax # EOF ? (%eax < 1) jl .Leof
That's all we need. Since %rdx
does not get clobbered on our system, we know it is 1. If %rax
is less than that, we know we either reached the end of the file or an error occured. In that case, we should jump to the code that exits the program.
Something that we need to know is that the smallest cmp
is when you compare a byte against %al
. But as of now, %rax
contains the return code of the read(2)
call. We stand to save eight bytes using the smallest cmp
. We will end up spending three bytes to save eight bytes, which is great.
We can make this happen by slightly tweaking our end of file check and then parsing our input after the end of file check:
xchg %ebp, %eax # Store return value in %ebp movb (%rsi), %al # cmpb imm, %al is the smallest cmp leal .LS, %esi # Preload string cmpl %edx, %ebp # EOF ? (%ebp < 1) jl .Leof cmpb $60, %al # '<' ? je .Lleft cmpb $62, %al # '>' ? je .Lright cmpb $45, %al # '-' ? je .Ldec cmpb $43, %al # '+' ? je .Linc cmpb $44, %al # ',' ? je .Lgetchar cmpb $46, %al # '.' ? je .Lputchar cmpb $91, %al # '[' ? je .Lopenloop cmpb $93, %al # ']' ? je .Lcloseloop jmp .Lparse # Comment character, skip
We can xchg
to get our smallest cmp
. This is the reason for the newer GNU assembler. The older assembler in base always encodes this to a 2-byte sequence but the newer assembler is smart enough to know this can be encoded as a 1-byte instruction and does so. We then put the character we read in into %rax
and preload the address of all our strings via a 7-byte leal
. That's very expensive, in fact it is the most expensive instruction we will encounter. But it is a necessary evil for us. Now we perform our end of file check and if we are not at the end of the file we continue on figuring out which character we have and what to do with it.
Let's leave the end of file processing for later and now tackle the Brainfuck symbol processing. There are eight symbols we need to handle:
.Lright: addl $4, %esi .Lleft: movb $4, %dl jmp .Lwrite
Let's take a look at just this for right now. This code handles both the <
and >
symbols. Recall that .LS
contains all our strings, prologue notwithstanding. The first character of .LS
corresponds to the <
symbol. We are calling that the .Lleft
label. That has an offset of 0 so we immediately move to telling write(2)
we want to write 4 characters and then head to the syscall. If we instead see a >
that's the .Lright
label and we need to offset the starting point of the string by 4. We then fall through because both outputs are 4 characters in length so we can combine the code to save bytes.
We will use these tricks a few more times. Let's continue writing the symbol processing code:
.Ldec: subl $5, %esi # 13 - 5 = 8 .Linc: addl $13, %esi movb $5, %dl jmp .Lwrite
Let's stop here too and take a look at one more trick. Both .Ldec
and .Linc
output strings that are the same length. We want to be able to combine the code like we did in the .Lleft
and .Lright
case. However, we don't have an index of 0 for either. We get around this by creatively subtracting the length of the string from the offset of the string farther from the beginning of .LS
to get the offset of the string closer to the beginning of .LS
and then fall through from there. These subtractions and additions, because they are within the range of -128 and 127 all encode to 2 bytes.
Let's continue on:
.Lgetchar: subl $13, %esi # 31 - 13 = 18 .Lputchar: addl $31, %esi movb $13, %dl jmp .Lwrite
Exactly the same tricks as .Linc
and .Ldec
. Note though that if I did not extend the putchar string by one character at the beginning, we could not use any of these tricks and it would cost us a whole lot more than the one byte it cost to elongate the putchar string.
Let's finish up our processing code:
.Lopenloop: incl %ebx # Increment loop counter addl $44, %esi movb $10, %dl jmp .Lwrite .Lcloseloop: decl %ebx # Decrement loop counter js .Lexit # %ebx < 0 ? (%rdi == 1 from the write(2) call) addl $63, %esi movb $1, %dl jmp .Lwrite
Unfortunately, we cannot use the fall through trick here but there are some other tricks we can use. We use the fact that all our general-purpose registers are initialized to 0 and the fact that %rbx
is saved across calls to just start using it effectively out of nowhere to act as our loop counter. The .Lopenloop
label is otherwise straightforward.
The .Lcloseloop
label has an additional trick. Because we want to have a good compiler that has proper return values, 0 for success and 1 for failure, we decrement the loop counter and check to see if it has fallen below 0. If it has, there is an imbalance of brackets and we should stop processing and error out. Unfortunately, dec
preserves the value of the carry flag, so we cannot use it to determine if we have gone below 0. However, it alters all other flags and we can use the sign flag to check if we have gone negative. The js
opcode jumps if the sign flag is set, which it only will be if we have gone negative. We can avoid a cmp
against -1 this way. Fortunately, %rdi
has already been set to 1 and is exactly the value we want for _exit(2)
when we error out so we do not have to make any adjustments to %rdi
. The rest is straightfoward.
We will use the fact that stdout
comes available to use from our environment without us having to do anything and as such we will write our compiled code to stdout
. We are also going to use another trick here: we will place the .Lwrite
label between the .Lparse
and .Lright
labels. This is because if you have a jmp
that is within the range of -128 and 127 bytes, that encodes to a short jump which is only 2 bytes. That is the smallest jmp
there is. We want all our jmp
s to be short jumps. Our .Lwrite
labels ends with a jmp
to .Lparse
in order to read in the next character and if it came after all the code we just wrote, it will be more than -128 bytes away.
Our .Lwrite
label is very short:
.Lwrite: movb $4, %al syscall # write(1, string, length); jmp .Lparse
That's it. All the writes have to put 4 into %rax
, make the syscall, and then read in the next character.
Now let's write the code to handle end of file and the code to handle exit, since we will use some fall through to handle them. Let's also put this new code immediately after .Lwrite
and before .Lright
, again to ensure all our jmp
s are short jumps.
It looks like this:
.Leof: cmpl %edx, %ebx # Loop counter < 1 ? (i.e., 0) jge .Lexit addl $54, %esi movb $10, %dl movb $4, %al syscall xorl %edi, %edi # Get ready to exit .Lexit: movb $1, %al # _exit(%edi); syscall
When we reach the end of the file, we need to make sure our loop counter is 0. If it is not, we have a bracket imbalance and should error out immediately. Like before, %rdi
happens to be 1 from the set up for the write(2)
call so we do not need to make any adjustments.
The remainder of the .Leof
label is writing the C epilogue. It then sets %rdi
to 0 because we are going to exit successfully and falls through to .Lexit
which sets the syscall number to the _exit(2)
syscall and calls it, which exits the program.
The last thing we need to write in order to finish our compiler is the prologue. This code goes immediately after the main
label and before the .Lparse
label.
We need to make a call to write(2)
with our .LSprologue
string:
popq %rdi # Write prologue (argc == 1, hopefully) leal .LSprologue, %esi movb $21, %dl jmp .Lwrite
Our final trick is immediately popping the top of the stack into %rdi
right at the start. As we learned before, our environment puts argc
in there. So long as argc
is 1, which it should be, that allows us to set up %rdi
correctly for this initial write(2)
call in just 1 byte. If not, well then your output will be incorrect. It is the one error that goes undetected by our compiler. We are going to accept that we have cut this corner.
And that's it! Our compiler is finished.
This is what our final completed compiler looks like:
.text .globl main main: popq %rdi # Write prologue (argc == 1, hopefully) leal .LSprologue, %esi movb $21, %dl jmp .Lwrite .Lparse: movb $3, %al # Load read(2) system call movb $1, %dl # Read one character xorl %edi, %edi # Read from stdin leaq (%rsp), %rsi # Read into top of stack syscall # read(0, (%rsp), 1); incl %edi # Set %edi to 1, for write xchg %ebp, %eax # Store return value in %ebp movb (%rsi), %al # cmpb imm, %al is the smallest cmp leal .LS, %esi # Preload string cmpl %edx, %ebp # EOF ? (%ebp < 1) jl .Leof cmpb $60, %al # '<' ? je .Lleft cmpb $62, %al # '>' ? je .Lright cmpb $45, %al # '-' ? je .Ldec cmpb $43, %al # '+' ? je .Linc cmpb $44, %al # ',' ? je .Lgetchar cmpb $46, %al # '.' ? je .Lputchar cmpb $91, %al # '[' ? je .Lopenloop cmpb $93, %al # ']' ? je .Lcloseloop jmp .Lparse # Comment character, skip .Lwrite: movb $4, %al syscall # write(1, string, length); jmp .Lparse .Leof: cmpl %edx, %ebx # Loop counter < 1 ? (i.e., 0) jge .Lexit addl $54, %esi movb $10, %dl movb $4, %al syscall xorl %edi, %edi # Get ready to exit .Lexit: movb $1, %al # _exit(%edi); syscall .Lright: addl $4, %esi .Lleft: movb $4, %dl jmp .Lwrite .Ldec: subl $5, %esi # 13 - 5 = 8 .Linc: addl $13, %esi movb $5, %dl jmp .Lwrite .Lgetchar: subl $13, %esi # 31 - 13 = 18 .Lputchar: addl $31, %esi movb $13, %dl jmp .Lwrite .Lopenloop: incl %ebx # Increment loop counter addl $44, %esi movb $10, %dl jmp .Lwrite .Lcloseloop: decl %ebx # Decrement loop counter js .Lexit # %ebx < 0 ? (%rdi == 1 from the write(2) call) addl $63, %esi movb $1, %dl jmp .Lwrite .LSprologue: .ascii "a[65536],*p=a;main(){" # 21 .LS: .ascii "--p;" # 4 .ascii "++p;" # 4 .ascii "--*p;" # 5 .ascii "++*p;" # 5 .ascii "*p=getchar();" # 13 .ascii "putchar(*p); " # 13 .ascii "while(*p){" # 10 .ascii "return 0;" # 9 .ascii "}" # 1
On OpenBSD, there is some mandatory overhead, 24 bytes worth. We will measure the size of our compiler three ways: by itself, with the OpenBSD overhead, and with all the ELF overhead.
The compiler by itself:
$ size bf256.o text data bss dec hex 231 0 0 231 e7
Only 231 bytes! Not bad. If I remember correctly, supposedly the original Brainfuck compiler was 240 bytes.
The compile with the OpenBSD overhead:
$ size bf256 text data bss dec hex 255 0 0 255 ff
So even if you insist that it is unfair that I excluded the OpenBSD overhead in my count, we still come in at only 255 bytes.
With all the ELF overhead:
$ ls -l bf256 -rwxr-xr-x 1 brian brian 952 Jul 10 19:38 bf256
Well, 952 bytes. OK, that's where the ELF overhead gets you. I don't want to get out a hex editor and start whittling down all that. I am going to say it is good enough and giving us all permission to ignore all the ELF overhead.
It is time for me to put away the text editor. My target was 256 bytes and I just scraped by. Can you find additional bytes to shave off?
I was able to save an additional jmp
and an extraneous movb $1, %dl
, saving an additional 4 bytes, which allowed me to add back one byte and change the initial popq %rdi
to an incl %edi
, guaranteeing that the initial prologue write will always work regardless of the number of arguments to the program and still save a byte. See it here.
I was able to save an additional byte by removing the extra space in the putchar string and using the last character in the getchar string to pad out the putchar string. As this is a ;
character, in C that becomes an empty statement, which is legal C. This gets us down to 944 bytes with all the ELF overhead. See it here.
I was able to save an additional 11 bytes by removing the epilogue string entirely and simply using the closeloop string for the epilogue. It seems that clang does not miss it. That gets us to 936 bytes with all the ELF overhead. See it here.
I was able to save an additional 6 bytes adapting a suggestion offered by @isaacg1 on GitHub, which was to combine strings in a way that allowed us to eliminate the left and right strings entirely. It ended up being more complicated than the initial suggestion. Now down to 928 bytes with all the ELF overhead. See it here.
I was able to save an additional 2 bytes by flipping the last entry in the parser switch, saving a jmp. This was discovered when working with the QBE compiler backend. Now only 208 bytes! See it here.
I was able to save an additional byte by replacing a movb
with an xchg
. Now only 207 bytes! See it here.