Dr. Brian Robert Callahan
academic, developer, with an eye towards a brighter techno-social life
All source code for this blog post can be found here.
One of the things we did not tackle when writing our PL/0 compiler was an optimizer. Our code generator output C and then we let our C compiler handle the rest. If we wanted optimizations, we could turn them on in the C compiler. That's great but it didn't help us learn anything about optimizations.
I test a lot of C compilers when testing oksh. One of those compilers is called cproc
. It's a C11 compiler that compiles C into QBE and then has QBE transform its intermediate language into assembly, which then follows the usual pipeline of being run through the system assembler and ultimately the system linker.
We saw QBE before when we wrote our brainfuck compiler in QBE IL. When observing how QBE transforms its IL into assembly, we noticed that it missed a chance at optimization in the case where you moved a zero into a register. The generated assembly used a mov
instruction to move an immediate of zero into the register. But the fastest and smallest way to put a zero into a register (at least, on amd64 machines) is to xor
the register with itself.
I noticed the lastest cproc
, which uses the latest QBE, still fails to make this optimization. At one point in time, the author of cproc
sent me a diff for QBE that did make the optimization but I guess it was never applied to the QBE repository. As the savings for this optimization really can add up over a large project, and since it gave us an opportunity to look at peephole optimizers, I decided to write a small program that could be run in the cproc
pipeline in between QBE and the system assembler to fix up this optimization. Maybe we could find more optimizations as well.
The simplest type of optimizer that we can write without having to dive into the vast literature on compiler optimizations is a peephole optimizer. In this technique, we read in some number of lines of assembly and perform basic pattern matching on those lines looking for instructions that we know can be replaced with better ones (or entirely eliminated). "Better" in this case brings up the first interesting question about optimization: better can be contextual based on your goals. Do you want instructions that take less time to execute? Do you want instructions that require fewer bytes to encode, making for a smaller program? Someting else? What is faster may not be smaller. What is smaller may not be faster. For us with our first optimization, we are lucky that the optimized version is both faster and smaller.
The peephole, or window, for our first optimization is just one line. There are some other peephole optimizations that can be performed with a peephole of just one line: removing useless instructions, such as moving a register into itself or adding zero to a register, immediately come to mind.
After we've completed the mov
to xor
optimization, we'll tackle an additional optimization that requires a larger peephole to enable.
We will name our peephole optimizer O
in honor of the C compiler flag (-O
) that is commonly used to tell the C compiler to turn on its optimizer.
I am going to write O
in C. Not because C is necessarily the best language for this task; it probably isn't. But because if we write O
in C, then we can run cproc
on O
itself and determine the size savings on itself. I am also going to rely on direct knowledge of cproc
to make our lives easier when figuring out what strings to match on. There is also a little bit of argc
and argv
handling with that knowledge so that O
can be plugged into cproc
when we are finished.
With all of that said, our main
function looks like this:
static int usage(void) { (void) fputs("usage: O in.s [-o out.s]\n", stderr); return 1; } int main(int argc, char *argv[]) { FILE *fp; if (argc == 4) { if (strcmp(argv[2], "-o") != 0) return usage(); if (freopen(argv[3], "w+", stdout) == NULL) { (void) fprintf(stderr, "O: error: couldn't open %s\n", argv[3]); } } else if (argc != 2) { return usage(); } if (!strcmp(argv[1], "-")) { O(stdin); return 0; } if ((fp = fopen(argv[1], "r")) == NULL) { (void) fprintf(stderr, "O: error: couldn't open %s\n", argv[1]); return 1; } O(fp); (void) fclose(fp); return 0; }
Because of how cproc
works, when the final stage of the compilation is to output assembly (-S
), cproc
automatically appends -o [file.s]
so O
needs to support that. If cproc
is going to write an object file or binary, then O
will write to stdout
and the assembler will read the output in via a pipe. Additionally, input from cproc
is always stdin
, demarcated as -
, so we need to support that too.
To avoid too much complexity, if we do have to output to a file, we can freopen(3)
that file to stdout
. Most of the time, we will be outputting to stdout
so it makes sense to make outputting to a file the special case.
getline(3)
function to read in one line at a time into a buffer. While the getline(3)
function is "new" with POSIX.1-2008, I think all modern Unix systems have the function. If your system does not, you can get a copy of getline(3)
here.
Reading in one line at a time might look something like this:
static void O(FILE *fp) { char *line = NULL; size_t size = 0; while (getline(&line, &size, fp) != -1) (void) fputs(line, stdout); free(line); }
Yes, you really do need the call to free(3)
at the end there.
This would read in one line at a time and then immediately print it out. Of course, we need to perform a check to match for potential lines to replace, so let's improve this a bit:
static void one(const char *line) { if (xorq(line)) return; if (xorl(line)) return; (void) fputs(line, stdout); } static void O(FILE *fp) { char *line = NULL; size_t size = 0; while (getline(&line, &size, fp) != -1) one(line); free(line); }
Now we are reading in one line at a time, and checking if it's a movq
line that can be replaced with an xorq
line. If not, perhaps it is a movl
line that can be replaced with an xorl
line. If not, then it is not a line that can be optimized this way so we should just print it out as-is.
Because we know exactly what QBE is going to output, we can use that to our advantage. We don't need to tokenize the output; we can simply check for exact strings. That saves us a lot of logic. A movq
line that we want to match will always look like this:
movq $0, %r__Where
__
reflects any of the 64-bit registers. Therefore, we can match exactly that and since that's the only form this line can be in, it is the only thing we need to check for. We want to replace that movq
line with this:
xorq %r__, %r__If we were to write that as a function, it might look like this:
static int xorq(const char *line) { if (!strncmp("\tmovq $0, %r", line, 12)) { (void) fprintf(stdout, "\txorq %%r%c%c, %%r%c%c\n", line[12], line[13], line[12], line[13]); return 1; } return 0; }
If you wanted to be paranoid, you could check that strlen(line) > 13
before calling fprintf
.
We check to see if the first twelve characters of the line match a movq
line that would be better if it was optimized into a xorq
line. If it does, we instead write the equivalent xorq
line. We are able to use the fact that we can learn the target register directly from the movq
line and use that in the rewritten xorq
line.
We can use the same strategy for optimizing movl
lines into xorl
lines. Here, we want to transform:
movl $0, %e__
Into:
xorl %e__, %e__
In code, that looks like this:
static int xorl(const char *line) { if (!strncmp("\tmovl $0, %e", line, 12)) { (void) fprintf(stdout, "\txorl %%e%c%c, %%e%c%c\n", line[12], line[13], line[12], line[13]); return 1; } return 0; }
We have written a peephole optimizer! It is guaranteed to improve the code that QBE produces.
But we can do better. The xorq
function doesn't account for %r8
or %r9
while the xorl
function doesn't account for any of the %r8d
-%r15d
registers. They'll need special tweaks, but it is nothing too difficult:
static int xorq(struct peephole *window) { char buf[32], r1a, r1b; if (window->line1 == NULL) return 0; if (!strncmp("\tmovq $0, %r", window->line1, 12)) { if (strlen(window->line1) < 14) return 0; r1a = window->line1[12]; r1b = window->line1[13]; if (r1b == '\n') r1b = ' '; (void) snprintf(buf, sizeof(buf), "\txorq %%r%c%c, %%r%c%c\n", r1a, r1b, r1a, r1b); free(window->line1); window->line1 = xstrdup(buf); return 1; } return 0; } static int xorl(struct peephole *window) { char buf[32], e1a, e1b; if (window->line1 == NULL) return 0; if (!strncmp("\tmovl $0, %e", window->line1, 12)) { if (strlen(window->line1) != 15) return 0; e1a = window->line1[12]; e1b = window->line1[13]; (void) snprintf(buf, sizeof(buf), "\txorl %%e%c%c, %%e%c%c\n", e1a, e1b, e1a, e1b); free(window->line1); window->line1 = xstrdup(buf); return 1; } else if (!strncmp("\tmovl $0, %r", window->line1, 12)) { if (strlen(window->line1) < 14) return 0; e1a = window->line1[12]; e1b = window->line1[13]; if (e1b == 'd') { (void) snprintf(buf, sizeof(buf), "\txorl %%r%cd, %%r%cd\n", e1a, e1a); } else { (void) snprintf(buf, sizeof(buf), "\txorl %%r%c%cd, %%r%c%cd\n", e1a, e1b, e1a, e1b); } free(window->line1); window->line1 = xstrdup(buf); return 1; } return 0; }
It is really important that we make these tweaks, since QBE really enjoys using the %r8d
register. That way, we can benefit from this optimization in those cases.
While we are not yet done with our peephole optimizer, since we will add another optimization, it would be good to test it out. We can test it out on the commandline by running O -
. Here is the results when I type in some assembly directly:
/home/brian $ O - movq $0, %rax xorq %rax, %rax movl $0, %ebx xorl %ebx, %ebx movq $1, %rax movq $1, %rax
I typed in the odd-numbered lines and O
responded with the even-numbered lines. Don't forget that you need to type in the leading tab, since O
requires the leading tab to make a successful match. We can see that right now the optimizer successfully replaces lines that match our logic and, importantly, successfully does not replace lines that do not match.
O
into cproc
We can execute a much bigger test by plugging O
into the cproc
pipeline. We should immediately begin to see improvements to the code we compile. And, since my OpenBSD port of cproc
does the whole three stage self-hosting compile, we will immediately learn if cproc
can handle this improvement to itself. I'm betting that it can, but it's always better to be sure.
There are two files that need to be patched. First, we need to patch the configure
script and teach it about O
, since that will place a needed data structure in a header file to be used later:
Index: configure --- configure.orig +++ configure @@ -159,6 +159,7 @@ static const char *const preprocesscmd[] = { "-D", "__extension__=", $defines}; static const char *const codegencmd[] = {"$DEFAULT_QBE"}; +static const char *const optimizecmd[] = {"O"}; static const char *const assemblecmd[] = {"$DEFAULT_ASSEMBLER"}; static const char *const linkcmd[] = {"$DEFAULT_LINKER", $linkflags}; EOF
Second, we need to patch driver.c
to teach the compiler driver about this new stage in the compilation pipeline:
Index: driver.c --- driver.c.orig +++ driver.c @@ -32,6 +32,7 @@ enum stage { PREPROCESS, COMPILE, CODEGEN, + OPTIMIZE, ASSEMBLE, LINK, }; @@ -60,6 +61,7 @@ static struct stageinfo stages[] = { [PREPROCESS] = {.name = "preprocess"}, [COMPILE] = {.name = "compile"}, [CODEGEN] = {.name = "codegen"}, + [OPTIMIZE] = {.name = "optimize"}, [ASSEMBLE] = {.name = "assemble"}, [LINK] = {.name = "link"}, }; @@ -381,6 +383,7 @@ main(int argc, char *argv[]) arrayaddbuf(&stages[PREPROCESS].cmd, preprocesscmd, sizeof(preprocesscmd)); arrayaddptr(&stages[COMPILE].cmd, compilecommand(argv[0])); arrayaddbuf(&stages[CODEGEN].cmd, codegencmd, sizeof(codegencmd)); + arrayaddbuf(&stages[OPTIMIZE].cmd, optimizecmd, sizeof(optimizecmd)); arrayaddbuf(&stages[ASSEMBLE].cmd, assemblecmd, sizeof(assemblecmd)); arrayaddbuf(&stages[LINK].cmd, linkcmd, sizeof(linkcmd)); @@ -400,6 +403,7 @@ main(int argc, char *argv[]) arrayaddptr(&stages[COMPILE].cmd, arch); arrayaddptr(&stages[CODEGEN].cmd, "-t"); arrayaddptr(&stages[CODEGEN].cmd, qbearch); + arrayaddptr(&stages[OPTIMIZE].cmd, "-"); for (;;) { ++argv, --argc; @@ -414,10 +418,10 @@ main(int argc, char *argv[]) switch (input->filetype) { case ASM: input->stages = 1<<ASSEMBLE|1<<LINK; break; case ASMPP: input->stages = 1<<PREPROCESS| 1<<ASSEMBLE|1<<LINK; break; - case C: input->stages = 1<<PREPROCESS|1<<COMPILE|1<<CODEGEN|1<<ASSEMBLE|1<<LINK; break; + case C: input->stages = 1<<PREPROCESS|1<<COMPILE|1<<CODEGEN|1<<OPTIMIZE|1<<ASSEMBLE|1<<LINK; break; case CHDR: input->stages = 1<<PREPROCESS ; break; - case CPPOUT: input->stages = 1<<COMPILE|1<<CODEGEN|1<<ASSEMBLE|1<<LINK; break; - case QBE: input->stages = 1<<CODEGEN|1<<ASSEMBLE|1<<LINK; break; + case CPPOUT: input->stages = 1<<COMPILE|1<<CODEGEN|1<<OPTIMIZE|1<<ASSEMBLE|1<<LINK; break; + case QBE: input->stages = 1<<CODEGEN|1<<OPTIMIZE|1<<ASSEMBLE|1<<LINK; break; case OBJ: input->stages = 1<<LINK; break; default: usage("reading from standard input requires -x"); } @@ -505,7 +509,7 @@ main(int argc, char *argv[]) arrayaddptr(&stages[PREPROCESS].cmd, "-P"); break; case 'S': - last = CODEGEN; + last = OPTIMIZE; break; case 's': arrayaddptr(&stages[LINK].cmd, "-s");
Perhaps not too surprising, adding these patches and running cproc
through a three stage compile produced a working compiler.
It's a bit unfair to compare numbers for cproc
itself, since we literally added code and therefore it is not a fair fight. But what we can do is test O
itself, once with the original cproc
and again with the new cproc
.
Here is the output of size(1)
on O.o
compiled with the original cproc
:
text data bss dec hex 975 138 0 1113 459
And again with the new cproc
:
text data bss dec hex 948 138 0 1086 43e
That's a savings of 27 bytes, or about a 2.77% reduction in binary size. That's not a huge amount, but it's also not nothing.
For comparison, with clang -O2
we get this size:
text data bss dec hex 914 16 0 930 3a2
And with clang -Oz
, that becomes:
text data bss dec hex 866 24 0 890 37a
But the big winner is gcc -Os
(gcc (GCC) 12.0.1 20220310 (experimental)
):
text data bss dec hex 713 0 0 713 2c9
So while QBE can't quite compete with the likes of GCC and Clang, it isn't designed to. It's designed to be fast and good enough, and it is. With O
, QBE is a little bit better without sacrificing too much compilation speed. After all, one of the selling points of cproc
is that compiles are much faster compared to GCC and Clang.
But O
is a very small codebase. Let's try something larger. Let's try oksh.
Here is the size measurements of oksh built with the old cproc
:
text data bss dec hex 278585 17363 29920 325868 4f8ec
And with the new cproc
:
text data bss dec hex 269745 17363 29920 317028 4d664
That's a savings of 8840 bytes, or about a 3.17% reduction in binary size. Again, it's not huge but it's also not nothing. Remember too it's also a speed increase on top of a size decrease, so we win with this optimization on both levels. While I'm sure the speed increase isn't much more than negligible, over a large enough span of time it probably adds up.
How about something large, like SQLite? With the old cproc
:
text data bss dec hex 1164826 90104 888 1255818 13298a
And with the new cproc
:
text data bss dec hex 1132438 90104 888 1223430 12ab06
That's a savings of 32388 bytes, or about a 2.78% reduction in binary size. I'm starting to see a trend. We're consistently getting between 2% and 3% reduction in binary size with this one optimization.
When reading through assembly generated by cproc
, I discovered a peculiar construct:
movq %rdi, %r12 .Lbb28: movl %ebx, %edi movl %edi, %ebx movq %r12, %rdi
It looks like this piece of assembly wants to use %rdi
as a temporary register for a register-to-register copy, so it is copying itself to %r12
, performing the register-to-register copy, then returning to its original value. But there's an opportunity for an optimization for this particular register-to-register copy: all we're doing is moving %ebx
into itself. We can say with certainty that the last three lines of this construct can be reduced to just the last line, since copying a register into itself via a temporary register and then immediately reassigning the temporary register is equivalent to simply assigning the final value to the temporary register and not bothering with the copy.
Interestingly, we cannot actually say with certainty if moving %rdi
into %r12
and back again is useless. It could be the case that it is useless, in which case we might be able to remove all four lines of assembly, leaving just the label. But it also might be the case that the very next line of assembly is performing an operation on %r12
that is dependent on %r12
having previously gotten a copy of the value of %rdi
. But there's also a label in the middle: so perhaps there is logic later where a value for the next iteration of a loop is placed in %r12
en route to %rdi
. We can't know any of that from the context we have in this small five-line window.
That brings us to another interesting problem with peephole optimizers: we need to be fairly conservative in our selection of optimizations. We should only implement optimizations that we can be absolutely certain have no undesirable side effects. The eliminaton of the third and fourth lines in that five-line window above has no side effects so that is the optimization we will teach O
.
First, we will need to slightly amend our O
function:
static int three(const char *line, FILE *fp) { if (mov(line, fp) == 1) return 1; return 0; } static void O(FILE *fp) { char *line = NULL; size_t size = 0; while (getline(&line, &size, fp) != -1) { if (three(line, fp) == 1) break; } free(line); }
The three
function will call the one
function when appropriate, so we will still benefit from our previous work.
We only have one optimization for a three-line peephole, which removes a useless register self-copy:
static int mov(const char *line, FILE *fp) { char *line2 = NULL, *line3 = NULL; size_t size = 0; if (strncmp("\tmovl %e", line, 8) != 0 && strncmp("\tmovq %r", line, 8) != 0) { one(line); return 0; } if (getline(&line2, &size, fp) == -1) { (void) fputs(line, stdout); return 1; } if (strncmp("\tmovl %e", line2, 8) != 0 && strncmp("\tmovq %r", line2, 8) != 0) { (void) fputs(line, stdout); one(line2); free(line2); return 0; } if (getline(&line3, &size, fp) == -1) { (void) fputs(line, stdout); (void) fputs(line2, stdout); free(line2); return 1; } if (strncmp("\tmovl", line3, 5) != 0 && strncmp("\tmovq", line3, 5) != 0) { (void) fputs(line, stdout); (void) fputs(line2, stdout); one(line3); free(line3); free(line2); return 0; } if (line[4] == line2[4] && line[7] == line2[13] && line[8] == line2[14] && line[9] == line2[15] && line[13] == line2[7] && line[14] == line2[8] && line[15] == line2[9] && line[14] == line3[14] && line[15] == line3[15]) { (void) fputs(line3, stdout); } else { (void) fputs(line, stdout); (void) fputs(line2, stdout); one(line3); } free(line3); free(line2); return 0; }
There's a lot here. Let's break it down.
We first recognize if the first line we read is a mov
with a 32-bit or 64-bit register as the source register. If not, then it definitely cannot be a candidate for this second optimization. However, it might still be a candidate for the first optimization we coded, so we should feed this line into the one
function. Regardless of whether or not this line is a successful candidate for the first optimization, after this we are done with this peephole so we should return
and start all over again with a new peephole.
You'll notice that we will return zero to say "we did not see the end of the file" and one to say "we have seen the end of the file."
Next is to realize that this optimization only works if we have at least three lines to interrogate. But we may be close to the end of the file; there may not be three lines left. We need to account for that. If there is no next line, then we are definitely done and we should simply print what we have. We know at this point the line we have is not a candidate for the first optimization, or else it would have entered into the previous if
statement.
And we continue on. It is exactly the same logic for the second line: we check to make sure it is a line that continues candidacy for this second optimization; we then check to ensure there is a third line that we can use to finish checking for candidacy.
Once we have read in all three lines, we need to know the following:
mov
(movl
or movq
)The large if
statement performs all these checks.
Something interesting is that we don't care if line three is a 32-bit movl
or a 64-bit movq
. Just so long as the destination register is the same in both line one and line three, that is enough to ensure the first two lines are useless. We do care that the type of mov
is the same for lines one and two, because if they are not it is possible that there are side effects that would render those two lines not useless.
This function also doesn't account for the fact that the 32-bit versions of %r12
-%r15
are named %r12d
-%r15d
. I haven't seen any cproc
code that uses those 32-bit registers in the type of construct that this optimization eliminates, but it is certainly possible. We'll probably have to write another function at some point that checks for this special case and acts accordingly.
I was able to rebuild cproc
with this new optimization enabled. It produced exactly the same binary compared to O
that only knew the first optimization. All that means is the construct that this second optimization eliminates wasn't present in the cproc
assembly. It also means that we didn't break anything by adding this second optimization. The codebase that did generate this peculiar assembly construct saw that the second optimization worked and the useless copy was successfully optimized away.
Astute readers may have noticed that our peephole optimizer, as good as it is, may fail to optimize constructs it knows how to optimize. Notably, O
will fail to optimize this:
movl %ecx, %eax movl %ebx, %edi movl %edi, %ebx movq %r12, %rdi
This is because the first three lines get read in, then we learn that this does not match the pattern we are looking for, so we print out all three lines and then read in the fourth line on its own, which doesn't match anything either.
It would be better if we printed out just the first line after discovering that the first three lines do not satisfy any optimization patterns, then check lines two through four for any optimization patterns; if we did that, we could optimize the last three lines.
It will take some additional coding and problem solving to be able to accomplish this goal. We will endeavor to do so in the next post.