Dr. Brian Robert Callahan
academic, developer, with an eye towards a brighter techno-social life
I am on a number of different compiler mailing lists. One of those is the mailing list for the Tiny C Compiler. Imagine my surprise when a recent email mentioned our latest series and suggested trying a similar approach to improve the code generator in TCC. I was interested to see if there were any low-hanging fruit that could produce similar gains as O
provides for QBE.
tcc
and cproc
We should familiarize ourselves with how both tcc
and cproc
generate their output. For the purposes of simplifying things, we will presume that both compiler front-ends are perfectly equivalent. I know they're not, but our interest right now isn't in the front-end.
As we previously learned, cproc
outputs QBE IL that is further processed by QBE to produce assembly, which is then processed by the system assembler to produce object code and (eventually) linked with the system linker to produce binaries and libraries. Because of this pipeline where each of the major steps is processed by a different standalone executable, we were able to write our own standalone executable, O
, that we inserted into this compilation pipeline after QBE but before the system assembler. That allowed us to leverage knowledge about the format of QBE-generated assembly to search for deficiencies and write a simple pattern-matching program to output optimized versions of specific patterns. Our O
program is a relatively straightforward peephole optimizer.
In contrast, tcc
combines the features of all the utilities needed to create a complete binary or library into a single executable. That means that tcc
incorporates a C preprocessor, C compiler, and linker all in one. There is also an integrated assembler in order to support inline assembly. Unlike cproc
, tcc
does not output assembly—it directly outputs object files. That's great in terms of speed: saving an entire step compared to cproc
(and gcc, for that matter) is bound to result in faster compiles, all else being equal.
However, taking the single-binary approach combined with directly outputting object code means that we are not going to be able to take the O
approach to optimization. There isn't a pipeline of standalone executables that we can insert a new utility into. And even if there were, while the ELF format does divide code into sections that we could selectively interrogate and run a peephole optimizer over, it will be far more work than reading over assembly code, requiring at a minimum a manual so we could figure out what the binary decodes to. As x86 and x86_64 are notoriously difficult to decode correctly, we would either have to spend countless sleepless nights pouring over Intel manuals (and still probably get it wrong sometimes) or trust someone else's decoder to get things right (and still probably get it wrong sometimes). Neither of these options are ideal, so we need to find another approach.
tcc
code generatorBecause tcc
directly outputs object code, we could take an approach where we read through the code generator spotting object code generation that is less than ideal and teaching tcc
better code to generate instead. To start out, I am only going to worry about x86_64 code generation. There are code generators for x86, arm, aarch64, riscv64, and TMS320C67x as well. Each code generator is its own file and there is no intermediate code so there are no IL-based optimizations that would benefit all the code generators. Nevertheless, I suspect there is sufficient usage of the x86_64 code generator to make this a good starting point.
The x86_64 code generator can be found here. We are also going to use a very useful tool to decode the object code that tcc
generates: objdump
. This utility is from the GNU binutils so chances are good you already have it installed on your system or can easily get it on your system. On my OpenBSD system, I have a little script that every night grabs the latest git HEAD of binutils and builds and installs it, so on my machine I am using GNU objdump (GNU Binutils) 2.38.50.20220406
but I am willing to bet that even old versions of objdump
will work fine for today.
What we will do is build some simple sample programs with tcc
and read the disassembled output from objdump
in order to find opportunities to teach the code generator how to do things better.
Therefore, when we talk about optimizing the code generator for today, we're not talking about making the code generator smaller or faster necessarily. We're talking about making the code generator smarter. If we're lucky, even though we're likely to add some code to make the code generator smarter, since the improved code generator will be run on itself, perhaps the optimization improvements will outweigh the additional code we will add.
With that said, there is an overarching concern that I want to care about: the overall goal of TCC is to be fast. It is about an order of magnitude faster than gcc. It turns out optimizations can be expensive to process and get right. I want to avoid making TCC take longer to compile code. Hopefully, our work today can make TCC both faster and produce better code. If not, we should at least not make TCC slower.
Let's start with the simplest C program. Even simpler than Hello World. How about the true program? How about something even smaller:
main(){}
That's it. That's the simplest complete C program I know how to write. Let's compile it with tcc
and read the disassembly with objdump
:
/home/brian $ tcc -c true.c /home/brian $ objdump -d true.o true.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <main>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 48 81 ec 00 00 00 00 sub $0x0,%rsp b: b8 00 00 00 00 mov $0x0,%eax 10: c9 leave 11: c3 ret
Our old friend movl $0, %eax
is back. We know what to do with that, and now we can even see why it is we'd want to make the replacement: movl $0, %eax
is five bytes. For comparison, xorl %eax, %eax
is only two bytes (0x31 0xc0
), so we get a 60% reduction in code size for this optimization. It is also recommended by Intel for zeroing a register.
As we learned when developing O
, optimizing all instances of mov $0
to xor
consistently reduced binary sizes between 2% and 3%. Let's teach tcc
how to make this improvement.
The code generator (and most of TCC) is a bit opaque in its coding. Lots of one-letter and two-letter variables and function names. Fortunately for us, some comments make it abundantly clear where we should focus our attention: there's a comment that clearly says /* mov $xx, r */
and that's the exact idiom we are looking for. We want to teach the code generator about a special case where if the immediate is zero, we can optimize this instruction to an xor
. There's separate code for 32-bit and 64-bit cases. Let's start with the 32-bit case.
The 32-bit case code can be found here. Let's imagine that we don't know what the orex()
function does. There is still some information we can glean in this block. The next line appears to emit a 32-bit immediate. It seems like the variable fc
represents the 32-bit immediate. So that must be the value for us to check. If fc
is zero, then we should apply the optimization. Even if we still don't know what the orex()
function does, one of its arguments is REG_VALUE(r)
, and its value is added to 0xb8
to form a mov
opcode, so it appears to be a register modifier.
Let's see what xorl %eax, %eax
encodes to. Yes, I know I already spoiled it earlier in this post, but let's confirm things. We can write a new assembly file named xor.s
and inside put:
xorl %eax, %eaxMy version of the GNU assembler is
GNU assembler (GNU Binutils) 2.38.50.20220406
. I have seen examples where newer versions of GNU as select better encodings for certain instructions than older versions, but I don't think that will matter much today. When I run as -o xor.o xor.s
and then run objdump -d xor.o
, here is the output I see:
xor.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <.text>: 0: 31 c0 xor %eax,%eax
Confirmed. We want to turn the binary pattern b8 00 00 00 00
into 31 c0
. At least, that will turn movl $0, %eax
into xorl %eax, %eax
. It will be slightly different for the other registers. How different? Let's find out.
One of the things to notice is that TCC does not use all the available registers. Looking at the list, it seems like TCC avoids using registers that the System V AMD64 ABI says are callee-saved registers because using callee-saved registers almost certainly complicates things. So we only have rax
, rcx
, rdx
, rsi
, rdi
, r8
, r9
, r10
, and r11
. While rsp
is in that enum, it is a callee-saved register; I'm pretty sure it's only in this enum to deal with stack frame setup and teardown. Let's not worry about the xmm
registers; those are for SSE instructions. The st0
register is for the floating-point unit.
So let's follow the order in this enum and edit our xor.s
test file to add additional xor
lines for each of these registers we have available for us to use:
xorl %eax, %eax xorl %ecx, %ecx xorl %edx, %edx xorl %esp, %esp xorl %esi, %esi xorl %edi, %edi xorl %r8d, %r8d xorl %r9d, %r9d xorl %r10d, %r10d xorl %r11d, %r11d
That assembly produces this output:
xor.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <.text>: 0: 31 c0 xor %eax,%eax 2: 31 c9 xor %ecx,%ecx 4: 31 d2 xor %edx,%edx 6: 31 e4 xor %esp,%esp 8: 31 f6 xor %esi,%esi a: 31 ff xor %edi,%edi c: 45 31 c0 xor %r8d,%r8d f: 45 31 c9 xor %r9d,%r9d 12: 45 31 d2 xor %r10d,%r10d 15: 45 31 db xor %r11d,%r11d
Turns out that the r8
-r11
registers require an extra byte to encode. It's still smaller than a mov
into those registers. However, I've already done a good bit of testing and it turns out that it getting that far down the register list is so rare I've never seen it happen. So we're just not going to bother with the three byte ones.
For the remaining six registers, the difference between each opcode is nine (c9-c0, d2-c9, ff-f6)—remember that we're missing rbx
and rbp
and those fill in the missing steps. The enum we previously saw does account for the missing registers. Let's put a pin in this for now.
Armed with this new knowledge, let's improve the 32-bit case. We want to check to see if the fc
variable is zero and if it is, and the register we are working with is one of the registers that encodes to a two byte xor
, then we should output code to xor
the register with itself. Otherwise, we should do what we've been doing: mov
the immediate into the register.
In code form, that would look like this:
if (fc == 0 && r < 8) { o(0x31); /* xor r, r */ o(0xc0 + REG_VALUE(r) * 9); } else { orex(0,r,0, 0xb8 + REG_VALUE(r)); /* mov $xx, r */ gen_le32(fc); }
That's it. That should improve the code generator for 32-bit mov
. Note that we do REG_VALUE(r) * 9
because we learned that each xor
a register with itself is nine opcodes away from the next one. What's in between? If you want to xor
a register with a different register.
mov
We can follow the exact same pattern for 64-bit mov
optimization. Note here that TCC is using sv->c.i
to store the value of the immediate. Even though this is a 64-bit optimization, we are going to want to use the 32-bit xor
. This is because 32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register. Or, a bit more succinctly, 32-bit operations clear the upper half of the 64-bit destination register. It takes three bytes to encode a 64-bit xor
but since the 32-bit xor
will clear the upper half of the register anyway, we can save a byte and get the same result. Here's the code:
if (sv->c.i == 0 && r < 8) { o(0x31); /* xor r, r */ o(0xc0 + REG_VALUE(r) * 9); } else { orex(1,r,0, 0xb8 + REG_VALUE(r)); /* mov $xx, r */ gen_le64(sv->c.i); }
Was it really that easy? Yes, yes it was.
...almost
Further down the code generator, there is a location where a hardcoded mov
immediate into %eax
is taking place. If that immediate is zero, then this is yet another location where we need to teach TCC about our xor
optimization. It took me a while to notice it at first, and I only noticed it because subsequent runs of objdump
told me I wasn't making all the optimizations I thought I should be.
Here is the improvement:
if (vtop->type.ref->f.func_type != FUNC_NEW) { /* implies FUNC_OLD or FUNC_ELLIPSIS */ if (nb_sse_args == 0) o(0xc031); /* xor eax, eax */ else oad(0xb8, nb_sse_args < 8 ? nb_sse_args : 8); /* mov nb_sse_args, %eax */ }
While we can congratulate ourselves on improving the code generator, we don't actually know we have made an improvement until we test things out. I want to test two different things: first, if we are producing smaller binaries; and second, if we are producing faster binaries. Again, I don't want to slow tcc
down. I am quite confident we will produce smaller binaries. But if tcc
takes longer to compile code, I'm not sure it would be worth it.
First, I ran the TCC test suite to make sure I didn't break anything. Everything passed. So far, so good.
Next, I calculated the difference in the .text
section of TCC binaries, libraries, and object files before and after applying this diff. Here are the numbers I saw:
Binary | Old size | New size | Difference in bytes | Percent size reduction |
---|---|---|---|---|
tcc |
328786 | 321358 | 7428 | 2.26% |
libtcc.a |
307288 | 300252 | 7036 | 2.29% |
bcheck.o |
23254 | 22801 | 453 | 1.95% |
bt-exe.o |
4732 | 4550 | 182 | 3.85% |
bt-log.o |
648 | 639 | 9 | 1.39% |
libtcc1.a |
12678 | 12119 | 559 | 4.41% |
That's pretty excellent. Even though we added code to the compiler, we still walked away with a pretty significant size reduction. That's a big win.
I also did a battery of build time tests and could find no statistically significant changes in compile time. In fact, the first time I ran the TCC test suite with the improved compiler, I was surprised to see that the times for that run were lower than they were with the old TCC. But it's not statistically significant as far as I can tell. This isn't all that unexpected as we really only added five truthiness checks, and those checks don't even run every time the code generator produces a new instruction.
Another TCC user posted some build times and .text
sizes for SQLite3. A 2.60% reduction in binary size and no difference in build times (though there isn't quite enough data in the email to make a statement on significance of build times).
I'm going to call this a win. Smaller binaries and no difference in compile times. This is an optimization worth sharing with the rest of the TCC community.
I posted the complete diff to the TCC mailing list. Other than a drive-by commit I made as part of my role as OpenBSD port maintainer of TCC to support the riscv64 arch, I haven't personally made any other commits to TCC, though I did author the original diff to finish TCC linker support on OpenBSD. I'm not exactly sure what the procedure is for committing something that is going to affect everyone to the mob
branch. So I'll just wait until I get a go-ahead. Until then, you can grab the patch from the mailing list. I suspect I will commit it soon.
I hope you enjoyed coming on this journey with me to hand-optimize the TCC x86_64 code generator. There are certainly more such improvements in the x86_64 code generator and all the other code generators. Let's find those opportunities and teach them to TCC.