Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life


Hand-optimizing the TCC code generator

I am on a number of different compiler mailing lists. One of those is the mailing list for the Tiny C Compiler. Imagine my surprise when a recent email mentioned our latest series and suggested trying a similar approach to improve the code generator in TCC. I was interested to see if there were any low-hanging fruit that could produce similar gains as O provides for QBE.

Differences between tcc and cproc

We should familiarize ourselves with how both tcc and cproc generate their output. For the purposes of simplifying things, we will presume that both compiler front-ends are perfectly equivalent. I know they're not, but our interest right now isn't in the front-end.

As we previously learned, cproc outputs QBE IL that is further processed by QBE to produce assembly, which is then processed by the system assembler to produce object code and (eventually) linked with the system linker to produce binaries and libraries. Because of this pipeline where each of the major steps is processed by a different standalone executable, we were able to write our own standalone executable, O, that we inserted into this compilation pipeline after QBE but before the system assembler. That allowed us to leverage knowledge about the format of QBE-generated assembly to search for deficiencies and write a simple pattern-matching program to output optimized versions of specific patterns. Our O program is a relatively straightforward peephole optimizer.

In contrast, tcc combines the features of all the utilities needed to create a complete binary or library into a single executable. That means that tcc incorporates a C preprocessor, C compiler, and linker all in one. There is also an integrated assembler in order to support inline assembly. Unlike cproc, tcc does not output assembly—it directly outputs object files. That's great in terms of speed: saving an entire step compared to cproc (and gcc, for that matter) is bound to result in faster compiles, all else being equal.

However, taking the single-binary approach combined with directly outputting object code means that we are not going to be able to take the O approach to optimization. There isn't a pipeline of standalone executables that we can insert a new utility into. And even if there were, while the ELF format does divide code into sections that we could selectively interrogate and run a peephole optimizer over, it will be far more work than reading over assembly code, requiring at a minimum a manual so we could figure out what the binary decodes to. As x86 and x86_64 are notoriously difficult to decode correctly, we would either have to spend countless sleepless nights pouring over Intel manuals (and still probably get it wrong sometimes) or trust someone else's decoder to get things right (and still probably get it wrong sometimes). Neither of these options are ideal, so we need to find another approach.

Hand-optimizing the tcc code generator

Because tcc directly outputs object code, we could take an approach where we read through the code generator spotting object code generation that is less than ideal and teaching tcc better code to generate instead. To start out, I am only going to worry about x86_64 code generation. There are code generators for x86, arm, aarch64, riscv64, and TMS320C67x as well. Each code generator is its own file and there is no intermediate code so there are no IL-based optimizations that would benefit all the code generators. Nevertheless, I suspect there is sufficient usage of the x86_64 code generator to make this a good starting point.

The x86_64 code generator can be found here. We are also going to use a very useful tool to decode the object code that tcc generates: objdump. This utility is from the GNU binutils so chances are good you already have it installed on your system or can easily get it on your system. On my OpenBSD system, I have a little script that every night grabs the latest git HEAD of binutils and builds and installs it, so on my machine I am using GNU objdump (GNU Binutils) but I am willing to bet that even old versions of objdump will work fine for today.

What we will do is build some simple sample programs with tcc and read the disassembled output from objdump in order to find opportunities to teach the code generator how to do things better.

Therefore, when we talk about optimizing the code generator for today, we're not talking about making the code generator smaller or faster necessarily. We're talking about making the code generator smarter. If we're lucky, even though we're likely to add some code to make the code generator smarter, since the improved code generator will be run on itself, perhaps the optimization improvements will outweigh the additional code we will add.

With that said, there is an overarching concern that I want to care about: the overall goal of TCC is to be fast. It is about an order of magnitude faster than gcc. It turns out optimizations can be expensive to process and get right. I want to avoid making TCC take longer to compile code. Hopefully, our work today can make TCC both faster and produce better code. If not, we should at least not make TCC slower.

Reading Hello World

Let's start with the simplest C program. Even simpler than Hello World. How about the true program? How about something even smaller:


That's it. That's the simplest complete C program I know how to write. Let's compile it with tcc and read the disassembly with objdump:

/home/brian $ tcc -c true.c
/home/brian $ objdump -d true.o

true.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <main>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   48 81 ec 00 00 00 00    sub    $0x0,%rsp
   b:   b8 00 00 00 00          mov    $0x0,%eax
  10:   c9                      leave
  11:   c3                      ret

Our old friend movl $0, %eax is back. We know what to do with that, and now we can even see why it is we'd want to make the replacement: movl $0, %eax is five bytes. For comparison, xorl %eax, %eax is only two bytes (0x31 0xc0), so we get a 60% reduction in code size for this optimization. It is also recommended by Intel for zeroing a register.

As we learned when developing O, optimizing all instances of mov $0 to xor consistently reduced binary sizes between 2% and 3%. Let's teach tcc how to make this improvement.

Wading through the code generator

The code generator (and most of TCC) is a bit opaque in its coding. Lots of one-letter and two-letter variables and function names. Fortunately for us, some comments make it abundantly clear where we should focus our attention: there's a comment that clearly says /* mov $xx, r */ and that's the exact idiom we are looking for. We want to teach the code generator about a special case where if the immediate is zero, we can optimize this instruction to an xor. There's separate code for 32-bit and 64-bit cases. Let's start with the 32-bit case.

The 32-bit case code can be found here. Let's imagine that we don't know what the orex() function does. There is still some information we can glean in this block. The next line appears to emit a 32-bit immediate. It seems like the variable fc represents the 32-bit immediate. So that must be the value for us to check. If fc is zero, then we should apply the optimization. Even if we still don't know what the orex() function does, one of its arguments is REG_VALUE(r), and its value is added to 0xb8 to form a mov opcode, so it appears to be a register modifier.

Let's see what xorl %eax, %eax encodes to. Yes, I know I already spoiled it earlier in this post, but let's confirm things. We can write a new assembly file named xor.s and inside put:

	xorl %eax, %eax
My version of the GNU assembler is GNU assembler (GNU Binutils) I have seen examples where newer versions of GNU as select better encodings for certain instructions than older versions, but I don't think that will matter much today. When I run as -o xor.o xor.s and then run objdump -d xor.o, here is the output I see:

xor.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <.text>:
   0:   31 c0                   xor    %eax,%eax

Confirmed. We want to turn the binary pattern b8 00 00 00 00 into 31 c0. At least, that will turn movl $0, %eax into xorl %eax, %eax. It will be slightly different for the other registers. How different? Let's find out.

Generalizing the pattern

One of the things to notice is that TCC does not use all the available registers. Looking at the list, it seems like TCC avoids using registers that the System V AMD64 ABI says are callee-saved registers because using callee-saved registers almost certainly complicates things. So we only have rax, rcx, rdx, rsi, rdi, r8, r9, r10, and r11. While rsp is in that enum, it is a callee-saved register; I'm pretty sure it's only in this enum to deal with stack frame setup and teardown. Let's not worry about the xmm registers; those are for SSE instructions. The st0 register is for the floating-point unit.

So let's follow the order in this enum and edit our xor.s test file to add additional xor lines for each of these registers we have available for us to use:

	xorl %eax, %eax
	xorl %ecx, %ecx
	xorl %edx, %edx
	xorl %esp, %esp
	xorl %esi, %esi
	xorl %edi, %edi
	xorl %r8d, %r8d
	xorl %r9d, %r9d
	xorl %r10d, %r10d
	xorl %r11d, %r11d

That assembly produces this output:

xor.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <.text>:
   0:   31 c0                   xor    %eax,%eax
   2:   31 c9                   xor    %ecx,%ecx
   4:   31 d2                   xor    %edx,%edx
   6:   31 e4                   xor    %esp,%esp
   8:   31 f6                   xor    %esi,%esi
   a:   31 ff                   xor    %edi,%edi
   c:   45 31 c0                xor    %r8d,%r8d
   f:   45 31 c9                xor    %r9d,%r9d
  12:   45 31 d2                xor    %r10d,%r10d
  15:   45 31 db                xor    %r11d,%r11d

Turns out that the r8-r11 registers require an extra byte to encode. It's still smaller than a mov into those registers. However, I've already done a good bit of testing and it turns out that it getting that far down the register list is so rare I've never seen it happen. So we're just not going to bother with the three byte ones.

For the remaining six registers, the difference between each opcode is nine (c9-c0, d2-c9, ff-f6)—remember that we're missing rbx and rbp and those fill in the missing steps. The enum we previously saw does account for the missing registers. Let's put a pin in this for now.

Adding our first optimization

Armed with this new knowledge, let's improve the 32-bit case. We want to check to see if the fc variable is zero and if it is, and the register we are working with is one of the registers that encodes to a two byte xor, then we should output code to xor the register with itself. Otherwise, we should do what we've been doing: mov the immediate into the register.

In code form, that would look like this:

            if (fc == 0 && r < 8) {
                o(0x31); /* xor r, r */
                o(0xc0 + REG_VALUE(r) * 9);
            } else {
                orex(0,r,0, 0xb8 + REG_VALUE(r)); /* mov $xx, r */

That's it. That should improve the code generator for 32-bit mov. Note that we do REG_VALUE(r) * 9 because we learned that each xor a register with itself is nine opcodes away from the next one. What's in between? If you want to xor a register with a different register.

Improving 64-bit mov

We can follow the exact same pattern for 64-bit mov optimization. Note here that TCC is using sv->c.i to store the value of the immediate. Even though this is a 64-bit optimization, we are going to want to use the 32-bit xor. This is because 32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register. Or, a bit more succinctly, 32-bit operations clear the upper half of the 64-bit destination register. It takes three bytes to encode a 64-bit xor but since the 32-bit xor will clear the upper half of the register anyway, we can save a byte and get the same result. Here's the code:

            if (sv->c.i == 0 && r < 8) {
                o(0x31); /* xor r, r */
                o(0xc0 + REG_VALUE(r) * 9);
            } else {
                orex(1,r,0, 0xb8 + REG_VALUE(r)); /* mov $xx, r */

Was it really that easy? Yes, yes it was.


An additional optimization location

Further down the code generator, there is a location where a hardcoded mov immediate into %eax is taking place. If that immediate is zero, then this is yet another location where we need to teach TCC about our xor optimization. It took me a while to notice it at first, and I only noticed it because subsequent runs of objdump told me I wasn't making all the optimizations I thought I should be.

Here is the improvement:

    if (vtop->type.ref->f.func_type != FUNC_NEW) { /* implies FUNC_OLD or FUNC_ELLIPSIS */
        if (nb_sse_args == 0)
            o(0xc031); /* xor eax, eax */
            oad(0xb8, nb_sse_args < 8 ? nb_sse_args : 8); /* mov nb_sse_args, %eax */


While we can congratulate ourselves on improving the code generator, we don't actually know we have made an improvement until we test things out. I want to test two different things: first, if we are producing smaller binaries; and second, if we are producing faster binaries. Again, I don't want to slow tcc down. I am quite confident we will produce smaller binaries. But if tcc takes longer to compile code, I'm not sure it would be worth it.

First, I ran the TCC test suite to make sure I didn't break anything. Everything passed. So far, so good.

Next, I calculated the difference in the .text section of TCC binaries, libraries, and object files before and after applying this diff. Here are the numbers I saw:

Binary Old size New size Difference in bytes Percent size reduction
tcc 328786 321358 7428 2.26%
libtcc.a 307288 300252 7036 2.29%
bcheck.o 23254 22801 453 1.95%
bt-exe.o 4732 4550 182 3.85%
bt-log.o 648 639 9 1.39%
libtcc1.a 12678 12119 559 4.41%

That's pretty excellent. Even though we added code to the compiler, we still walked away with a pretty significant size reduction. That's a big win.

I also did a battery of build time tests and could find no statistically significant changes in compile time. In fact, the first time I ran the TCC test suite with the improved compiler, I was surprised to see that the times for that run were lower than they were with the old TCC. But it's not statistically significant as far as I can tell. This isn't all that unexpected as we really only added five truthiness checks, and those checks don't even run every time the code generator produces a new instruction.

Another TCC user posted some build times and .text sizes for SQLite3. A 2.60% reduction in binary size and no difference in build times (though there isn't quite enough data in the email to make a statement on significance of build times).

I'm going to call this a win. Smaller binaries and no difference in compile times. This is an optimization worth sharing with the rest of the TCC community.

Work continues...

I posted the complete diff to the TCC mailing list. Other than a drive-by commit I made as part of my role as OpenBSD port maintainer of TCC to support the riscv64 arch, I haven't personally made any other commits to TCC, though I did author the original diff to finish TCC linker support on OpenBSD. I'm not exactly sure what the procedure is for committing something that is going to affect everyone to the mob branch. So I'll just wait until I get a go-ahead. Until then, you can grab the patch from the mailing list. I suspect I will commit it soon.

I hope you enjoyed coming on this journey with me to hand-optimize the TCC x86_64 code generator. There are certainly more such improvements in the x86_64 code generator and all the other code generators. Let's find those opportunities and teach them to TCC.