Dr. Brian Robert Callahan
academic, developer, with an eye towards a brighter techno-social life
All source code for this blog post can be found here.
It's time to finish up our peephole optimizer, O
, that we started in the previous post. This peephole optimizer, designed to be plugged into cproc
and used to fix up QBE generated amd64 assembly, provides two optimizations that we found reduced binary size by between 2% and 3%: first, a one-line optimization converting mov_ $0, %___
into xor_ %___, %___
and a three-line optimization removing a useless register-to-register self-copy via a temporary register.
The optimizer as we've designed it reads in three lines from the original assembly code, then applies the three-line optimization first. If it succeeds, O
prints out the optimization and starts all over with the next three lines of the original assembly code. If it fails, it applies the one-line optimizations to each of the three lines in order, printing out the optimized version of each line if it succeeds or the original assembly code if it fails. In visual form, the optimizer works as follows:
+---------------+ |addq %rax, %rcx| addq %rax, %rcx addq %rax, %rcx |subq %rbx, %rdx| subq %rbx, %rdx subq %rbx, %rdx |movl $0, %edi | xorl %edi, %edi xorl %edi, %edi +---------------+ --> +---------------+ --> movl %ebx, %esi |movl %ebx, %esi| movl %eax, %esi movl %esi, %ebx |movl %esi, %ebx| movl %eax, %esi |movl %eax, %esi| +---------------+
Which is all well and good, until you realize this approach will fail to optimize in certain cases. Take this case, for example:
+---------------+ |movl %ecx, %edx| movl %ecx, %edx movl %ecx, %edx |movl %ebx, %esi| movl %ebx, %esi movl %ebx, %esi |movl %esi, %ebx| movl %esi, %ebx movl %esi, %ebx +---------------+ --> +---------------+ --> movl %eax, %esi |movl %eax, %esi| movl %eax, %esi +---------------+
But we know that lines two, three, and four can be optimized to just line four. So what we need is for O
to "slide" down the code one line at a time. Like this:
+---------------+ |movl %ecx, %edx| movl %ecx, %edx movl %ecx, %edx movl %ecx, %edx |movl %ebx, %esi| +---------------+ movl %eax, %esi movl %eax, %esi |movl %esi, %ebx| --> |movl %ebx, %esi| --> +---------------+ --> xorl %eax, %eax +---------------+ |movl %esi, %ebx| |movl $0, %eax | movl %eax, %esi |movl %eax, %esi| +---------------+ movl $0, %eax +---------------+ movl $0, %eax
Besides the obvious advantage of not missing out on any optimization opportunities, we can also feed the one-line result of the three-line optimization into the one-line optimizations, yielding the ability to perform multiple optimizations on a single window, like this:
+---------------+ +---------------+ |movl %ebx, %esi| |movl $0, %esi | |movl %esi, %ebx| --> | | --> xorl %esi, %esi |movl $0, %esi | | | +---------------+ +---------------+
We could even detect when the three-line optimization was successful and keep that optimized line in the window, fill the rest of the window, and check if that new window can also have the three-line optimization applied to it, like this:
+---------------+ +---------------+ +---------------+ +---------------+ |movl %ebx, %esi| |movl %eax, %esi| |movl %eax, %esi| |movl $0, %esi | |movl %esi, %ebx| --> | | --> |movl %esi, %eax| --> | | --> xorl %esi, %esi |movl %eax, %esi| | | |movl $0, %esi | | | +---------------+ +---------------+ +---------------+ +---------------+ movl %esi, %eax movl %esi, %eax movl $0, %esi movl $0, %esi
Or, in flow chart form:
+-----+ |Start| +-----+ | No +-+ +----------------------------------------------------------+ | | | V ------------- V -------------- +-----------+ / \ No +----------------------------+ +---------+ +------------+ / \ Yes +---+ |Fill window|-->|Is window full?|-->|Apply one-line optimizations|-->|Emit code|-->|Shift window|-->|Is window empty?|--->|End| +-----------+ \ / +----------------------------+ +---------+ +------------+ \ / +---+ ^ ------------- -------------- | | | Yes| +------------------------------+ +----------------------------+ | +->|Apply three-line optimizations|-->|Apply one-line optimizations| | +------------------------------+ +----------------------------+ | | | +------------+ +---------+ | +---|Shift window|<--|Emit code|<--------------------------------------+ +------------+ +---------+
Very cool! And beginning to look like a real peephole optimizer. We could take this framework and apply any number of optimizations that work within this three-line peephole. We could even grow the size of the peephole if we wanted to attempt optimizations that require more context.
And we can also add two-line optimizations to be run between the three-line and one-line optimizations. I have not yet found any optimization opportunities that QBE misses that require looking at only two lines of assembly. But it would be easy to plug in such optimizations into the framework we're designing.
We are going to need a new data structure to represent the peephole that the optimizer will examine. That's the three lines of the original assembly that optimizations will be applied to. It is incredibly straightforward:
struct peephole { char *line1; char *line2; char *line3; };
And that's it. Now what we need to do is fill that window with assembly code. I'm going to start by modifying the O
function:
static void O(FILE *fp) { struct peephole window; window.line1 = NULL; window.line2 = NULL; window.line3 = NULL; while (fillwindow(&window, fp)) { three(&window); one(&window); (void) fputs(window.line1, stdout); shiftwindow(&window); } }
Yes, it appears that we are going to be changing at least the function signatures of all the functions we wrote in the previous post. Don't worry; it's not too onerous. Plus we'll learn a lot along the way.
For now, let's write the fillwindow
and shiftwindow
functions:
static char * xstrdup(const char *s) { char *p; if (s == NULL) return NULL; if ((p = strdup(s)) == NULL) { (void) fputs("O: error: xstrdup failed\n", stderr); exit(1); } return p; } static int fillwindow(struct peephole *window, FILE *fp) { size_t size = 0; if (window->line1 == NULL) { if (getline(&window->line1, &size, fp) == -1) return 0; } if (window->line2 == NULL) { if (getline(&window->line2, &size, fp) == -1) return 0; } if (getline(&window->line3, &size, fp) == -1) return 0; return 1; } static void shiftwindow(struct peephole *window) { free(window->line1); window->line1 = xstrdup(window->line2); free(window->line2); window->line2 = xstrdup(window->line3); }We also had to write a helper function,
xstrdup
, in order to copy over strings during the shiftwindow
function. The strdup
function will cause the program to crash if the input string is NULL
. We want to track when the window is not completely full, as that tells us that we are nearing the end of the original assembly code, so we will return NULL
if that happens, avoiding the crash. It is OK if you free(NULL)
.
It's important to note that an empty line in our original assembly does not cause the string read in by getline
to be NULL
: according to the manual page, the buffer returned is both NUL
-terminated and includes the delimiter. That means an empty line in the original assembly will be represented as "\n"
which is completely fine for the framework we're designing. It simultaneously communicates that this is not the end of the assembly code and it is almost certainly not a candidate for any optimizations.
The fillwindow
function fills in the window, but only lines that actually need filling. We do not want to replace lines that already exist in the window and still need to be processed, as that means we might miss out on an optimization opportunity.
We return zero if getline
returns -1
, which indicates we are at the end of the original assembly code. And we stop filling in the window if we have reached the end of the original assembly, since there's little reason to try to fill in the window with content we know is not there.
So long as our peephole is full, we can perform all optimizations. I think it makes sense to try the three-line optimization first and then try the one-line optimizations on line one of the peephole. If we had optimizations that operated on two lines of assembly, we would probably stick those optimizations in the middle.
But what if the window is not full? If it is not full, then it is possible we still have one or two lines in the window that we have not made optimization attempts on. We should still run the one-line optimizations on them. So we need to amend the O
function slightly:
static void O(FILE *fp) { struct peephole window; window.line1 = NULL; window.line2 = NULL; window.line3 = NULL; while (fillwindow(&window, fp)) { three(&window); one(&window); (void) fputs(window.line1, stdout); shiftwindow(&window); } free(window.line3); window.line3 = NULL; while (window.line1 != NULL) { one(&window); (void) fputs(window.line1, stdout); shiftwindow(&window); } }
If we run out of lines to read, then we know that at least the last line of the window did not have code read into it, so we can free
it and set it to NULL
. Depending on the results of the final fillwindow
call that found the end of the original assembly code, we may or may not have code in lines one and two. But if we do, both of those lines should still try to have optimizations applied to them. The final loop ensures that we stop processing only after we have actually processed all lines of the original assembly and our window is empty.
We have two one-line optimization passes: the first to convert movq $0, %r__
into xorq %r__, %r__
and the second to turn movl $0, %e__
into xorl %e__, %e__
. Let's improve them to work with our new peephole data structure:
static int xorq(struct peephole *window) { char buf[32], r1a, r1b; if (window->line1 == NULL) return 0; if (!strncmp("\tmovq $0, %r", window->line1, 12)) { if (strlen(window->line1) < 14) return 0; r1a = window->line1[12]; r1b = window->line1[13]; if (r1b == '\n') r1b = ' '; (void) snprintf(buf, sizeof(buf), "\txorq %%r%c%c, %%r%c%c\n", r1a, r1b, r1a, r1b); free(window->line1); window->line1 = xstrdup(buf); return 1; } return 0; } static int xorl(struct peephole *window) { char buf[32], e1a, e1b; if (window->line1 == NULL) return 0; if (!strncmp("\tmovl $0, %e", window->line1, 12)) { if (strlen(window->line1) != 15) return 0; e1a = window->line1[12]; e1b = window->line1[13]; (void) snprintf(buf, sizeof(buf), "\txorl %%e%c%c, %%e%c%c\n", e1a, e1b, e1a, e1b); free(window->line1); window->line1 = xstrdup(buf); return 1; } else if (!strncmp("\tmovl $0, %r", window->line1, 12)) { if (strlen(window->line1) < 14) return 0; e1a = window->line1[12]; e1b = window->line1[13]; if (e1b == 'd') { (void) snprintf(buf, sizeof(buf), "\txorl %%r%cd, %%r%cd\n", e1a, e1a); } else { (void) snprintf(buf, sizeof(buf), "\txorl %%r%c%cd, %%r%c%cd\n", e1a, e1b, e1a, e1b); } free(window->line1); window->line1 = xstrdup(buf); return 1; } return 0; } static void one(struct peephole *window) { if (xorq(window)) return; if (xorl(window)) return; }
This is not too different compared to what we had before. The major difference is that we write the new optimized line into a temporary buffer before copying it into the window.
This one is a little more complicated than the one-line passes. While here, I am also going to improve the optimization to include the possibility that the last line is an immediate-to-register copy:
static int mov(struct peephole *window) { if (window->line1 == NULL || window->line2 == NULL || window->line3 == NULL) { return 0; } if (strncmp("\tmovl %e", window->line1, 8) != 0 && strncmp("\tmovq %r", window->line1, 8) != 0) { return 0; } if (strncmp("\tmovl %e", window->line2, 8) != 0 && strncmp("\tmovq %r", window->line2, 8) != 0) { return 0; } if (strncmp("\tmovl %e", window->line3, 8) != 0 && strncmp("\tmovq %r", window->line3, 8) != 0 && strncmp("\tmovl $0, %e", window->line3, 12) != 0 && strncmp("\tmovq $0, %r", window->line3, 12) != 0) { return 0; } if (window->line1[4] == window->line2[4] && window->line1[7] == window->line2[13] && window->line1[8] == window->line2[14] && window->line1[9] == window->line2[15] && window->line1[13] == window->line2[7] && window->line1[14] == window->line2[8] && window->line1[15] == window->line2[9] && window->line1[14] == window->line3[strlen(window->line3) - 3] && window->line1[15] == window->line3[strlen(window->line3) - 2]) { free(window->line2); window->line2 = NULL; free(window->line1); window->line1 = xstrdup(window->line3); free(window->line3); window->line3 = NULL; return 1; } return 0; } static int three(struct peephole *window) { return mov(window); }
We first want to make sure the window is full. Because if it is not, there is no reason to even attempt this optimization. Next is the same routine we had previously where we make sure that each of the three lines in the window matches what we would expect as candidates for the optimization. You'll notice too we modified the last check to also accept a final line that is an immediate-to-register copy or a register-to-register copy.
If we have an opportunity for optimizattion, all we need to do is free
lines one and two, set line two to NULL
, copy line three into line one, and finally free
line three and set it to NULL
.
As a reminder, we are using the fact that we already know the format in which QBE writes assembly code to make our lives much easier; we don't need to worry about things like tokenization. If you don't know what QBE-generated assembly looks like, run cproc -S O.c
and you'll see QBE-generated assembly in O.s
.
As it stands, O
can make all its optimizations and it can even perform multiple optimizations on a single window. But what it can't do is take the output of the three-line optimization, save the resulting one line in the window, fill the rest of the window, and then try to perform the three-line optimization again on the new window. We should enable that. It's relatively easy to do so. We only need to tweak the O
function one last time:
static void O(FILE *fp) { struct peephole window; int ret; window.line1 = NULL; window.line2 = NULL; window.line3 = NULL; while (fillwindow(&window, fp)) { again: ret = three(&window); one(&window); if (ret == 1) { if (fillwindow(&window, fp)) goto again; } (void) fputs(window.line1, stdout); shiftwindow(&window); } free(window.line3); window.line3 = NULL; while (window.line1 != NULL) { one(&window); (void) fputs(window.line1, stdout); shiftwindow(&window); } }
This is why we have the three
function return an int
: we want to know when this optimizaton was successful precisely because we want to be able to know if we can chain together multiple three-line optimizations. If for some reason we can't chain three-line optimizations because there isn't enough original assembly code to do so, that's fine; we can just continue as normal and things will work out OK.
You can see the completed version of the peephole optimizer we wrote in these two blog posts here. There may be additional optimizations added in the future, so make sure to check out the main
branch from time to time as well.
I hope you enjoyed this mini-series on how to write peephole optimizers. While they are little more than simple pattern matching programs, as we saw they can help fix up code and enable optimizations that more powerful techniques can sometimes miss.
If you'd like to try implementing some more one-line optimizations to test your skills, some other optimizations that are great for peephole optimizers are elimination of direct register-to-register self-copies (e.g., removing movq %rax, %rax
), elimination of addition by zero (e.g., removing addq $0, %rax
), and replacement of multiplication by a positive power of two with left arithmetic shift (e.g., replacing imull $8, %eax
with sall $3, %eax
).
If you find additional optimizations that QBE misses, feel free to open a Pull Request (and let QBE upstream know about the potential optimizations too!).