Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life



[prev]
[next]

2021-07-10
I wrote a 231-byte Brainfuck compiler by abusing everything

All source code for this blog post can be found here.

We wrote a Brainfuck compiler back in April. That one developed a simple architecture to support multiple backends. Today, I want to write a new Brainfuck compiler. One that can only run on a single system, OpenBSD/amd64. And one that is as small as I can possibly make it. I broke into the sub-256 bytes club with my compiler, so I thought it was worthwhile sharing and talking about how I was able to pull it off.

While this works on my machine, there may be subtle differences about how your machine works that cause this compiler not to work. Make sure to check your machine's behavior if there are any differences!

We are optimizing exclusively for size. That means that we may make decisions that are less performant than other ways of writing a Brainfuck compiler. That's OK.

The assembly code will be written using AT&T syntax. Last time I had a blog post with assembly code, someone complained that it was not Intel syntax. I am not much bothered by either and I teach my undergraduates both syntaxes because I think it is important to be able to read both. If you are an Intel syntax person, take this post as an opportunity to improve your AT&T syntax skills.

Reviewing Brainfuck

Let's quickly review the syntax of Brainfuck so we know what we need to implement. The table below presumes that we are translating Brainfuck to C, have a global array acting as the Brainfuck tape, and have a pointer, *p, that is intialized to the beginning of the array.

Symbol C equivalent
< --p;
> ++p;
- --*p;
+ ++*p;
, *p = getchar();
. putchar(*p);
[ while (*p) {
] }

Choosing an implementation language

I think the only way we stand a chance at creating a compiler as small as possible is to use hand-optimized assembly. Those precious bytes aren't going to save themselves!

We will be using the GNU binutils for our assembler and the OpenBSD system linker, which is LLVM lld, for our linker. I am using the GNU assembler from ports, which is a much newer version of the assembler. This newer assembler is a little bit smarter when it comes to encoding and that ends up saving a byte for us down the line. I'll point it out when we get there.

Coding our strings

Our compiler will compile Brainfuck to C. I chose C because it will allow us to have some of the smallest output code, and by extension the shortest strings needed to have built-in to our compiler. We will use the table above for our translation.

Our compiler therefore begins as follows:

	.text
	.globl	main
main:

.LSprologue:
	.ascii	"a[65536],*p=a;main(){"	# 21

.LS:
	.ascii	"--p;"		# 4
	.ascii	"++p;"		# 4
	.ascii	"--*p;"		# 5
	.ascii	"++*p;"		# 5
	.ascii	"*p=getchar();"	# 13
	.ascii	"putchar(*p); "	# 13
	.ascii	"while(*p){"	# 10
	.ascii	"return 0;"	# 9
	.ascii	"}"		# 1

The top is some boilerplate to inform the toolchain that we have a global symbol named main and it (and everything else, for that matter) lives in the text section of our binary.

Beginning with .LSprototype is our strings. Labels beginning with .L are considered local labels for ELF targets like my machine. After each string I put a comment with the number of characters in the string. We will use the OpenBSD kernel facilities for reading and writing, and the write(2) system call has a parameter for number of bytes to write. Yes, we could write a strlen(3) function, and in fact we did that for our echo program. However, writing such a function would take too many bytes compared to simply shoving the correct value into the register.

The .LSprologue string is separate from the rest of the .LS strings because the way I organized the compiler the prologue is used directly and for all other strings we index into the string starting at .LS.

You'll also notice I am using the .ascii pseudo-op rather than the .asciz pseudo-op. This means that there is no NUL-termination of any of these strings, so they are not C strings. Since we are providing the exact number of bytes to write with each write(2) call, we can get away with this and save a number of bytes in the process.

We are also leaving them in the .text section. That technically means they are code. We could place them in the .rodata section but since these string are never written to, I think it is fine.

We can also see some more byte-shaving at work. While we are compiling to C, we didn't specify which standard we were targeting. We are going very old by having implicit ints everywhere. We are also going to be implicitly declaring getchar(3) and putchar(3) which is invalid in C99 and later, according to clang. So we will target C89 and/or K&R C.

As a result of the implicit ints the cells on the Brainfuck tape are 32-bits in size. There is no requirement for any particular cell size, so this is acceptable from a Brainfuck perspective.

I also made the getchar and putchar strings equal length, even though the putchar string is actually one character shorter. We will spend a byte now to save several bytes later.

Reading code into a buffer

Our parser needs to read in the program one character at a time and process each character. If it is one of the Brainfuck characers, we output the appropriate translation, otherwise we move on to the next character. We are going to take advantage of the fact that we can read from stdin and write to stdout without having to set anything up outselves; these facilities are set up for us by our environment.

We will also need to look up the syscall numbers for read(2) and write(2), which are 3 and 4.

We must read in a character before we can parse it and write anything. Our read can look like this:

.Lparse:
	movb	$3, %al		# Load read(2) system call
	movb	$1, %dl		# Read one character
	xorl	%edi, %edi	# Read from stdin
	leaq	(%rsp), %rsi	# Read into top of stack
	syscall			# read(0, (%rsp), 1);
	incl	%edi		# Set %edi to 1, for write

We can take advantage of some assumptions on our system for this to work. On our system all general-purpose registers are initialized to 0 by the environment before program start. The movb instructions require only 2 bytes but do not 0-out the remainder of the register. If there was something already in %rax or %rdx above the lowest byte, it would remain as-is. But since our environment set these registers to 0, we are OK. These movb instructions are 2 bytes each.

We also chose to use xorl instead of movb for the %rdi register. This is because you cannot use movb on %rdi since it is at its smallest a 16-bit register. Since we are making it 0, xor'ing any 32-bit register with itself is the smallest way to do this at 2 bytes.

The leaq instruction is expensive at 4 bytes. It's actually one of the smallest versions of this instruction and we will see a 7-byte version later. Finally, we need to call the syscall for 2 bytes. That's 12 bytes total to get a character from stdin.

We are also assuming that our environment set up some stack for us and that our stack pointer points to something reasonable. That happens to be true. As we learned before, the environment does contain our argc and argv variables and that is the case here too. So we have a reasonable stack pointer.

While it is not directly related to reading in a character, it turns out that %rdi stays the same across read(2) calls, so we can increment it to prepare for an upcoming write(2) call. Incrementing is 2 bytes, which is the smallest way to turn a 0 into a 1 for us. Even in the case where we do not actually end up writing anything, for example because we read in a comment character, it does not much matter since we xor the register before the read(2) call every time.

Checking for EOF

Now would be a good time to check if we are at the end of the file, since if we are we have finished our work and should exit the program. Fortunately, read(2) returns the number of bytes read in %rax. If that number is 1, then we are not yet finished. Otherwise, if it is 0 or -1 then we are finished. A 0 means end of file and a -1 means an error occured. We don't actually care if we reached end of file or encountered an error. Both of those mean it is time to exit for us.

Quick aside: The System V AMD64 ABI calling convention

Our machine uses the System V AMD64 ABI calling convention. This dictates how our registers are used and how registers are saved or not across function calls, including syscalls. For us, we need to know that %rax will contain the return value of a call, so we should presume that it will change after each call. Additionally, %rax contains the syscall number prior to making the syscall. The calling convention also says that the %rbx, %rsp, %rbp, and %r12-%r15 registers must be restored to their original value by the callee if they are used. That means as far as we are concerned, we can use those registers across calls and presume that their value will stay the same. Any other registers may be clobbered by calls. We must also respect the registers used for call parameters. Since read(2) and write(2) use three parameters, we must make sure to use %rdi, %rsi, and %rdx for those parameters, like we did with our read(2) call above.

It happens to be the case that that %rdi and %rdx do not get clobbered by read(2) and write(2) calls on our machine, which we will use to our advantage.

Checking for EOF, part 2

Now we can check for EOF after each read:

	cmpl	%edx, %eax	# EOF ? (%eax < 1)
	jl	.Leof

That's all we need. Since %rdx does not get clobbered on our system, we know it is 1. If %rax is less than that, we know we either reached the end of the file or an error occured. In that case, we should jump to the code that exits the program.

Parsing the input

Something that we need to know is that the smallest cmp is when you compare a byte against %al. But as of now, %rax contains the return code of the read(2) call. We stand to save eight bytes using the smallest cmp. We will end up spending three bytes to save eight bytes, which is great.

We can make this happen by slightly tweaking our end of file check and then parsing our input after the end of file check:

	xchg	%ebp, %eax	# Store return value in %ebp
	movb	(%rsi), %al	# cmpb imm, %al is the smallest cmp
	leal	.LS, %esi	# Preload string
	cmpl	%edx, %ebp	# EOF ? (%ebp < 1)
	jl	.Leof
	cmpb	$60, %al	# '<' ?
	je	.Lleft
	cmpb	$62, %al	# '>' ?
	je	.Lright
	cmpb	$45, %al	# '-' ?
	je	.Ldec
	cmpb	$43, %al	# '+' ?
	je	.Linc
	cmpb	$44, %al	# ',' ?
	je	.Lgetchar
	cmpb	$46, %al	# '.' ?
	je	.Lputchar
	cmpb	$91, %al	# '[' ?
	je	.Lopenloop
	cmpb	$93, %al	# ']' ?
	je	.Lcloseloop
	jmp	.Lparse		# Comment character, skip

We can xchg to get our smallest cmp. This is the reason for the newer GNU assembler. The older assembler in base always encodes this to a 2-byte sequence but the newer assembler is smart enough to know this can be encoded as a 1-byte instruction and does so. We then put the character we read in into %rax and preload the address of all our strings via a 7-byte leal. That's very expensive, in fact it is the most expensive instruction we will encounter. But it is a necessary evil for us. Now we perform our end of file check and if we are not at the end of the file we continue on figuring out which character we have and what to do with it.

Processing each possibility

Let's leave the end of file processing for later and now tackle the Brainfuck symbol processing. There are eight symbols we need to handle:

.Lright:
	addl	$4, %esi
.Lleft:
	movb	$4, %dl
	jmp	.Lwrite

Let's take a look at just this for right now. This code handles both the < and > symbols. Recall that .LS contains all our strings, prologue notwithstanding. The first character of .LS corresponds to the < symbol. We are calling that the .Lleft label. That has an offset of 0 so we immediately move to telling write(2) we want to write 4 characters and then head to the syscall. If we instead see a > that's the .Lright label and we need to offset the starting point of the string by 4. We then fall through because both outputs are 4 characters in length so we can combine the code to save bytes.

We will use these tricks a few more times. Let's continue writing the symbol processing code:

.Ldec:
	subl	$5, %esi	# 13 - 5 = 8
.Linc:
	addl	$13, %esi
	movb	$5, %dl
	jmp	.Lwrite

Let's stop here too and take a look at one more trick. Both .Ldec and .Linc output strings that are the same length. We want to be able to combine the code like we did in the .Lleft and .Lright case. However, we don't have an index of 0 for either. We get around this by creatively subtracting the length of the string from the offset of the string farther from the beginning of .LS to get the offset of the string closer to the beginning of .LS and then fall through from there. These subtractions and additions, because they are within the range of -128 and 127 all encode to 2 bytes.

Let's continue on:

.Lgetchar:
	subl	$13, %esi	# 31 - 13 = 18
.Lputchar:
	addl	$31, %esi
	movb	$13, %dl
	jmp	.Lwrite

Exactly the same tricks as .Linc and .Ldec. Note though that if I did not extend the putchar string by one character at the beginning, we could not use any of these tricks and it would cost us a whole lot more than the one byte it cost to elongate the putchar string.

Let's finish up our processing code:

.Lopenloop:
	incl	%ebx		# Increment loop counter
	addl	$44, %esi
	movb	$10, %dl
	jmp	.Lwrite
.Lcloseloop:
	decl	%ebx		# Decrement loop counter
	js	.Lexit		# %ebx < 0 ? (%rdi == 1 from the write(2) call)
	addl	$63, %esi
	movb	$1, %dl
	jmp	.Lwrite

Unfortunately, we cannot use the fall through trick here but there are some other tricks we can use. We use the fact that all our general-purpose registers are initialized to 0 and the fact that %rbx is saved across calls to just start using it effectively out of nowhere to act as our loop counter. The .Lopenloop label is otherwise straightforward.

The .Lcloseloop label has an additional trick. Because we want to have a good compiler that has proper return values, 0 for success and 1 for failure, we decrement the loop counter and check to see if it has fallen below 0. If it has, there is an imbalance of brackets and we should stop processing and error out. Unfortunately, dec preserves the value of the carry flag, so we cannot use it to determine if we have gone below 0. However, it alters all other flags and we can use the sign flag to check if we have gone negative. The js opcode jumps if the sign flag is set, which it only will be if we have gone negative. We can avoid a cmp against -1 this way. Fortunately, %rsi has already been set to 1 and is exactly the value we want for _exit(2) when we error out so we do not have to make any adjustments to %rdi. The rest is straightfoward.

Writing output

We will use the fact that stdout comes available to use from our environment without us having to do anything and as such we will write our compiled code to stdout. We are also going to use another trick here: we will place the .Lwrite label between the .Lparse and .Lright labels. This is because if you have a jmp that is within the range of -128 and 127 bytes, that encodes to a short jump which is only 2 bytes. That is the smallest jmp there is. We want all our jmps to be short jumps. Our .Lwrite labels ends with a jmp to .Lparse in order to read in the next character and if it came after all the code we just wrote, it will be more than -128 bytes away.

Our .Lwrite label is very short:

.Lwrite:
	movb	$4, %al
	syscall			# write(1, string, length);
	jmp	.Lparse

That's it. All the writes have to put 4 into %rax, make the syscall, and then read in the next character.

Dealing with end of file and exiting

Now let's write the code to handle end of file and the code to handle exit, since we will use some fall through to handle them. Let's also put this new code immediately after .Lwrite and before .Lright, again to ensure all our jmps are short jumps.

It looks like this:

.Leof:
	cmpl	%edx, %ebx	# Loop counter < 1 ? (i.e., 0)
	jge	.Lexit
	addl	$54, %esi
	movb	$10, %dl
	movb	$4, %al
	syscall
	xorl	%edi, %edi	# Get ready to exit
.Lexit:
	movb	$1, %al		# _exit(%edi);
	syscall

When we reach the end of the file, we need to make sure our loop counter is 0. If it is not, we have a bracket imbalance and should error out immediately. Like before, %rdi happens to be 1 from the set up for the write(2) call so we do not need to make any adjustments.

The remainder of the .Leof label is writing the C epilogue. It then sets %rdi to 0 because we are going to exit successfully and falls through to .Lexit which sets the syscall number to the _exit(2) syscall and calls it, which exits the program.

Writing the prologue

The last thing we need to write in order to finish our compiler is the prologue. This code goes immediately after the main label and before the .Lparse label.

We need to make a call to write(2) with our .LSprologue string:

	popq	%rdi		# Write prologue (argc == 1, hopefully)
	leal	.LSprologue, %esi
	movb	$21, %dl
	jmp	.Lwrite

Our final trick is immediately popping the top of the stack into %rdi right at the start. As we learned before, our environment puts argc in there. So long as argc is 1, which it should be, that allows us to set up %rdi correctly for this initial write(2) call in just 1 byte. If not, well then your output will be incorrect. It is the one error that goes undetected by our compiler. We are going to accept that we have cut this corner.

And that's it! Our compiler is finished.

Putting it all together

This is what our final completed compiler looks like:

	.text
	.globl	main
main:
	popq	%rdi		# Write prologue (argc == 1, hopefully)
	leal	.LSprologue, %esi
	movb	$21, %dl
	jmp	.Lwrite
.Lparse:
	movb	$3, %al		# Load read(2) system call
	movb	$1, %dl		# Read one character
	xorl	%edi, %edi	# Read from stdin
	leaq	(%rsp), %rsi	# Read into top of stack
	syscall			# read(0, (%rsp), 1);
	incl	%edi		# Set %edi to 1, for write
	xchg	%ebp, %eax	# Store return value in %ebp
	movb	(%rsi), %al	# cmpb imm, %al is the smallest cmp
	leal	.LS, %esi	# Preload string
	cmpl	%edx, %ebp	# EOF ? (%ebp < 1)
	jl	.Leof
	cmpb	$60, %al	# '<' ?
	je	.Lleft
	cmpb	$62, %al	# '>' ?
	je	.Lright
	cmpb	$45, %al	# '-' ?
	je	.Ldec
	cmpb	$43, %al	# '+' ?
	je	.Linc
	cmpb	$44, %al	# ',' ?
	je	.Lgetchar
	cmpb	$46, %al	# '.' ?
	je	.Lputchar
	cmpb	$91, %al	# '[' ?
	je	.Lopenloop
	cmpb	$93, %al	# ']' ?
	je	.Lcloseloop
	jmp	.Lparse		# Comment character, skip
.Lwrite:
	movb	$4, %al
	syscall			# write(1, string, length);
	jmp	.Lparse
.Leof:
	cmpl	%edx, %ebx	# Loop counter < 1 ? (i.e., 0)
	jge	.Lexit
	addl	$54, %esi
	movb	$10, %dl
	movb	$4, %al
	syscall
	xorl	%edi, %edi	# Get ready to exit
.Lexit:
	movb	$1, %al		# _exit(%edi);
	syscall
.Lright:
	addl	$4, %esi
.Lleft:
	movb	$4, %dl
	jmp	.Lwrite
.Ldec:
	subl	$5, %esi	# 13 - 5 = 8
.Linc:
	addl	$13, %esi
	movb	$5, %dl
	jmp	.Lwrite
.Lgetchar:
	subl	$13, %esi	# 31 - 13 = 18
.Lputchar:
	addl	$31, %esi
	movb	$13, %dl
	jmp	.Lwrite
.Lopenloop:
	incl	%ebx		# Increment loop counter
	addl	$44, %esi
	movb	$10, %dl
	jmp	.Lwrite
.Lcloseloop:
	decl	%ebx		# Decrement loop counter
	js	.Lexit		# %ebx < 0 ? (%rdi == 1 from the write(2) call)
	addl	$63, %esi
	movb	$1, %dl
	jmp	.Lwrite

.LSprologue:
	.ascii	"a[65536],*p=a;main(){"	# 21

.LS:
	.ascii	"--p;"		# 4
	.ascii	"++p;"		# 4
	.ascii	"--*p;"		# 5
	.ascii	"++*p;"		# 5
	.ascii	"*p=getchar();"	# 13
	.ascii	"putchar(*p); "	# 13
	.ascii	"while(*p){"	# 10
	.ascii	"return 0;"	# 9
	.ascii	"}"		# 1

How many bytes is our compiler?

On OpenBSD, there is some mandatory overhead, 24 bytes worth. We will measure the size of our compiler three ways: by itself, with the OpenBSD overhead, and with all the ELF overhead.

The compiler by itself:

$ size bf256.o
text    data    bss     dec     hex
231     0       0       231     e7

Only 231 bytes! Not bad. If I remember correctly, supposedly the original Brainfuck compiler was 240 bytes.

The compile with the OpenBSD overhead:

$ size bf256
text    data    bss     dec     hex
255     0       0       255     ff

So even if you insist that it is unfair that I excluded the OpenBSD overhead in my count, we still come in at only 255 bytes.

With all the ELF overhead:

$ ls -l bf256
-rwxr-xr-x  1 brian  brian  952 Jul 10 19:38 bf256

Well, 952 bytes. OK, that's where the ELF overhead gets you. I don't want to get out a hex editor and start whittling down all that. I am going to say it is good enough and giving us all permission to ignore all the ELF overhead.

Conclusion

It is time for me to put away the text editor. My target was 256 bytes and I just scraped by. Can you find additional bytes to shave off?

Update: A 207-byte compiler

I was able to save an additional jmp and an extraneous movb $1, %dl, saving an additional 4 bytes, which allowed me to add back one byte and change the initial popq %rdi to an incl %edi, guaranteeing that the initial prologue write will always work regardless of the number of arguments to the program and still save a byte. See it here.

I was able to save an additional byte by removing the extra space in the putchar string and using the last character in the getchar string to pad out the putchar string. As this is a ; character, in C that becomes an empty statement, which is legal C. This gets us down to 944 bytes with all the ELF overhead. See it here.

I was able to save an additional 11 bytes by removing the epilogue string entirely and simply using the closeloop string for the epilogue. It seems that clang does not miss it. That gets us to 936 bytes with all the ELF overhead. See it here.

I was able to save an additional 6 bytes adapting a suggestion offered by @isaacg1 on GitHub, which was to combine strings in a way that allowed us to eliminate the left and right strings entirely. It ended up being more complicated than the initial suggestion. Now down to 928 bytes with all the ELF overhead. See it here.

I was able to save an additional 2 bytes by flipping the last entry in the parser switch, saving a jmp. This was discovered when working with the QBE compiler backend. Now only 208 bytes! See it here.

I was able to save an additional byte by replacing a movb with an xchg. Now only 207 bytes! See it here.

Top

RSS