Dr. Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life

2022-04-02
I wrote a simple, explainable, peephole optimizer (for cproc and QBE), part 2

All source code for this blog post can be found here.

It's time to finish up our peephole optimizer, O, that we started in the previous post. This peephole optimizer, designed to be plugged into cproc and used to fix up QBE generated amd64 assembly, provides two optimizations that we found reduced binary size by between 2% and 3%: first, a one-line optimization converting mov_ $0, %___ into xor_ %___, %___ and a three-line optimization removing a useless register-to-register self-copy via a temporary register.

The optimizer as we've designed it reads in three lines from the original assembly code, then applies the three-line optimization first. If it succeeds, O prints out the optimization and starts all over with the next three lines of the original assembly code. If it fails, it applies the one-line optimizations to each of the three lines in order, printing out the optimized version of each line if it succeeds or the original assembly code if it fails. In visual form, the optimizer works as follows:

+---------------+
|addq %rax, %rcx|      addq %rax, %rcx      addq %rax, %rcx
|subq %rbx, %rdx|      subq %rbx, %rdx      subq %rbx, %rdx
|movl $0, %edi  |      xorl %edi, %edi      xorl %edi, %edi
+---------------+ --> +---------------+ -->
 movl %ebx, %esi      |movl %ebx, %esi|     movl %eax, %esi
 movl %esi, %ebx      |movl %esi, %ebx|
 movl %eax, %esi      |movl %eax, %esi|
                      +---------------+

Which is all well and good, until you realize this approach will fail to optimize in certain cases. Take this case, for example:

+---------------+
|movl %ecx, %edx|      movl %ecx, %edx      movl %ecx, %edx
|movl %ebx, %esi|      movl %ebx, %esi      movl %ebx, %esi
|movl %esi, %ebx|      movl %esi, %ebx      movl %esi, %ebx
+---------------+ --> +---------------+ -->
 movl %eax, %esi      |movl %eax, %esi|     movl %eax, %esi
                      +---------------+

But we know that lines two, three, and four can be optimized to just line four. So what we need is for O to "slide" down the code one line at a time. Like this:

+---------------+
|movl %ecx, %edx|      movl %ecx, %edx       movl %ecx, %edx      movl %ecx, %edx
|movl %ebx, %esi|     +---------------+      movl %eax, %esi      movl %eax, %esi
|movl %esi, %ebx| --> |movl %ebx, %esi| --> +---------------+ --> xorl %eax, %eax
+---------------+     |movl %esi, %ebx|     |movl $0, %eax  |
 movl %eax, %esi      |movl %eax, %esi|     +---------------+
 movl $0, %eax        +---------------+
                       movl $0, %eax

Besides the obvious advantage of not missing out on any optimization opportunities, we can also feed the one-line result of the three-line optimization into the one-line optimizations, yielding the ability to perform multiple optimizations on a single window, like this:

+---------------+     +---------------+
|movl %ebx, %esi|     |movl $0, %esi  |
|movl %esi, %ebx| --> |               | --> xorl %esi, %esi
|movl $0, %esi  |     |               |
+---------------+     +---------------+

We could even detect when the three-line optimization was successful and keep that optimized line in the window, fill the rest of the window, and check if that new window can also have the three-line optimization applied to it, like this:

+---------------+     +---------------+     +---------------+     +---------------+
|movl %ebx, %esi|     |movl %eax, %esi|     |movl %eax, %esi|     |movl $0, %esi  |
|movl %esi, %ebx| --> |               | --> |movl %esi, %eax| --> |               | --> xorl %esi, %esi
|movl %eax, %esi|     |               |     |movl $0, %esi  |     |               |
+---------------+     +---------------+     +---------------+     +---------------+
 movl %esi, %eax       movl %esi, %eax
 movl $0, %esi         movl $0, %esi

Or, in flow chart form:

+-----+
|Start|
+-----+
   |                                                                          No
   +-+                                            +----------------------------------------------------------+
     |                                            |                                                          |
     V            -------------                   V                                                   --------------
+-----------+    /             \ No +----------------------------+   +---------+   +------------+    /              \ Yes +---+
|Fill window|-->|Is window full?|-->|Apply one-line optimizations|-->|Emit code|-->|Shift window|-->|Is window empty?|--->|End|
+-----------+    \             /    +----------------------------+   +---------+   +------------+    \              /     +---+
     ^            -------------                                                                       --------------
     |                  |
     |               Yes|  +------------------------------+   +----------------------------+
     |                  +->|Apply three-line optimizations|-->|Apply one-line optimizations|
     |                     +------------------------------+   +----------------------------+
     |                                                                      |
     |   +------------+   +---------+                                       |
     +---|Shift window|<--|Emit code|<--------------------------------------+
         +------------+   +---------+

Very cool! And beginning to look like a real peephole optimizer. We could take this framework and apply any number of optimizations that work within this three-line peephole. We could even grow the size of the peephole if we wanted to attempt optimizations that require more context.

And we can also add two-line optimizations to be run between the three-line and one-line optimizations. I have not yet found any optimization opportunities that QBE misses that require looking at only two lines of assembly. But it would be easy to plug in such optimizations into the framework we're designing.

A new data structure

We are going to need a new data structure to represent the peephole that the optimizer will examine. That's the three lines of the original assembly that optimizations will be applied to. It is incredibly straightforward:

struct peephole {
	char *line1;
	char *line2;
	char *line3;
};

And that's it. Now what we need to do is fill that window with assembly code. I'm going to start by modifying the O function:

static void
O(FILE *fp)
{
	struct peephole window;

	window.line1 = NULL;
	window.line2 = NULL;
	window.line3 = NULL;

	while (fillwindow(&window, fp)) {
		three(&window);
		one(&window);

		(void) fputs(window.line1, stdout);

		shiftwindow(&window);
	}
}

Yes, it appears that we are going to be changing at least the function signatures of all the functions we wrote in the previous post. Don't worry; it's not too onerous. Plus we'll learn a lot along the way.

For now, let's write the fillwindow and shiftwindow functions:

static char *
xstrdup(const char *s)
{
	char *p;

	if (s == NULL)
		return NULL;

	if ((p = strdup(s)) == NULL) {
		(void) fputs("O: error: xstrdup failed\n", stderr);

		exit(1);
	}

	return p;
}

static int
fillwindow(struct peephole *window, FILE *fp)
{
	size_t size = 0;

	if (window->line1 == NULL) {
		if (getline(&window->line1, &size, fp) == -1)
			return 0;
	}

	if (window->line2 == NULL) {
		if (getline(&window->line2, &size, fp) == -1)
			return 0;
	}

	if (getline(&window->line3, &size, fp) == -1)
		return 0;

	return 1;
}

static void
shiftwindow(struct peephole *window)
{

	free(window->line1);
	window->line1 = xstrdup(window->line2);

	free(window->line2);
	window->line2 = xstrdup(window->line3);
}

We also had to write a helper function, xstrdup, in order to copy over strings during the shiftwindow function. The strdup function will cause the program to crash if the input string is NULL. We want to track when the window is not completely full, as that tells us that we are nearing the end of the original assembly code, so we will return NULL if that happens, avoiding the crash. It is OK if you free(NULL).

It's important to note that an empty line in our original assembly does not cause the string read in by getline to be NULL: according to the manual page, the buffer returned is both NUL-terminated and includes the delimiter. That means an empty line in the original assembly will be represented as "\n" which is completely fine for the framework we're designing. It simultaneously communicates that this is not the end of the assembly code and it is almost certainly not a candidate for any optimizations.

The fillwindow function fills in the window, but only lines that actually need filling. We do not want to replace lines that already exist in the window and still need to be processed, as that means we might miss out on an optimization opportunity.

We return zero if getline returns -1, which indicates we are at the end of the original assembly code. And we stop filling in the window if we have reached the end of the original assembly, since there's little reason to try to fill in the window with content we know is not there.

So long as our peephole is full, we can perform all optimizations. I think it makes sense to try the three-line optimization first and then try the one-line optimizations on line one of the peephole. If we had optimizations that operated on two lines of assembly, we would probably stick those optimizations in the middle.

But what if the window is not full? If it is not full, then it is possible we still have one or two lines in the window that we have not made optimization attempts on. We should still run the one-line optimizations on them. So we need to amend the O function slightly:

static void
O(FILE *fp)
{
	struct peephole window;

	window.line1 = NULL;
	window.line2 = NULL;
	window.line3 = NULL;

	while (fillwindow(&window, fp)) {
		three(&window);
		one(&window);

		(void) fputs(window.line1, stdout);

		shiftwindow(&window);
	}

	free(window.line3);
	window.line3 = NULL;

	while (window.line1 != NULL) {
		one(&window);

		(void) fputs(window.line1, stdout);

		shiftwindow(&window);
	}
}

If we run out of lines to read, then we know that at least the last line of the window did not have code read into it, so we can free it and set it to NULL. Depending on the results of the final fillwindow call that found the end of the original assembly code, we may or may not have code in lines one and two. But if we do, both of those lines should still try to have optimizations applied to them. The final loop ensures that we stop processing only after we have actually processed all lines of the original assembly and our window is empty.

Rewriting the one-line optimization passes

We have two one-line optimization passes: the first to convert movq $0, %r__ into xorq %r__, %r__ and the second to turn movl $0, %e__ into xorl %e__, %e__. Let's improve them to work with our new peephole data structure:

static int
xorq(struct peephole *window)
{
	char buf[32], r1a, r1b;

	if (window->line1 == NULL)
		return 0;

	if (!strncmp("\tmovq $0, %r", window->line1, 12)) {
		if (strlen(window->line1) < 14)
			return 0;

		r1a = window->line1[12];
		r1b = window->line1[13];

		if (r1b == '\n')
			r1b = ' ';

		(void) snprintf(buf, sizeof(buf), "\txorq %%r%c%c, %%r%c%c\n",
		    r1a, r1b, r1a, r1b);

		free(window->line1);
		window->line1 = xstrdup(buf);

		return 1;
	}

	return 0;
}

static int
xorl(struct peephole *window)
{
	char buf[32], e1a, e1b;

	if (window->line1 == NULL)
		return 0;

	if (!strncmp("\tmovl $0, %e", window->line1, 12)) {
		if (strlen(window->line1) != 15)
			return 0;

		e1a = window->line1[12];
		e1b = window->line1[13];

		(void) snprintf(buf, sizeof(buf), "\txorl %%e%c%c, %%e%c%c\n",
		    e1a, e1b, e1a, e1b);

		free(window->line1);
		window->line1 = xstrdup(buf);

		return 1;
	} else if (!strncmp("\tmovl $0, %r", window->line1, 12)) {
		if (strlen(window->line1) < 14)
			return 0;

		e1a = window->line1[12];
		e1b = window->line1[13];

		if (e1b == 'd') {
			(void) snprintf(buf, sizeof(buf),
			    "\txorl %%r%cd, %%r%cd\n", e1a, e1a);
		} else {
			(void) snprintf(buf, sizeof(buf),
			    "\txorl %%r%c%cd, %%r%c%cd\n", e1a, e1b, e1a, e1b);
		}

		free(window->line1);
		window->line1 = xstrdup(buf);

		return 1;
	}

	return 0;
}

static void
one(struct peephole *window)
{

	if (xorq(window))
		return;

	if (xorl(window))
		return;
}

This is not too different compared to what we had before. The major difference is that we write the new optimized line into a temporary buffer before copying it into the window.

Rewriting the three-line optimization pass

This one is a little more complicated than the one-line passes. While here, I am also going to improve the optimization to include the possibility that the last line is an immediate-to-register copy:

static int
mov(struct peephole *window)
{

	if (window->line1 == NULL || window->line2 == NULL ||
		window->line3 == NULL) {
		return 0;
	}

	if (strncmp("\tmovl %e", window->line1, 8) != 0 &&
		strncmp("\tmovq %r", window->line1, 8) != 0) {
		return 0;
	}

	if (strncmp("\tmovl %e", window->line2, 8) != 0 &&
		strncmp("\tmovq %r", window->line2, 8) != 0) {
		return 0;
	}

	if (strncmp("\tmovl %e", window->line3, 8) != 0 &&
		strncmp("\tmovq %r", window->line3, 8) != 0 &&
		strncmp("\tmovl $0, %e", window->line3, 12) != 0 &&
		strncmp("\tmovq $0, %r", window->line3, 12) != 0) {
			return 0;
	}

	if (window->line1[4] == window->line2[4] &&
		window->line1[7] == window->line2[13] &&
		window->line1[8] == window->line2[14] &&
		window->line1[9] == window->line2[15] &&
		window->line1[13] == window->line2[7] &&
		window->line1[14] == window->line2[8] &&
		window->line1[15] == window->line2[9] &&
		window->line1[14] == window->line3[strlen(window->line3) - 3] &&
		window->line1[15] == window->line3[strlen(window->line3) - 2]) {
		free(window->line2);
		window->line2 = NULL;

		free(window->line1);
		window->line1 = xstrdup(window->line3);

		free(window->line3);
		window->line3 = NULL;

		return 1;
	}

	return 0;
}

static int
three(struct peephole *window)
{

	return mov(window);
}

We first want to make sure the window is full. Because if it is not, there is no reason to even attempt this optimization. Next is the same routine we had previously where we make sure that each of the three lines in the window matches what we would expect as candidates for the optimization. You'll notice too we modified the last check to also accept a final line that is an immediate-to-register copy or a register-to-register copy.

If we have an opportunity for optimizattion, all we need to do is free lines one and two, set line two to NULL, copy line three into line one, and finally free line three and set it to NULL.

As a reminder, we are using the fact that we already know the format in which QBE writes assembly code to make our lives much easier; we don't need to worry about things like tokenization. If you don't know what QBE-generated assembly looks like, run cproc -S O.c and you'll see QBE-generated assembly in O.s.

One last tweak

As it stands, O can make all its optimizations and it can even perform multiple optimizations on a single window. But what it can't do is take the output of the three-line optimization, save the resulting one line in the window, fill the rest of the window, and then try to perform the three-line optimization again on the new window. We should enable that. It's relatively easy to do so. We only need to tweak the O function one last time:

static void
O(FILE *fp)
{
	struct peephole window;
	int ret;

	window.line1 = NULL;
	window.line2 = NULL;
	window.line3 = NULL;

	while (fillwindow(&window, fp)) {
again:
		ret = three(&window);
		one(&window);

		if (ret == 1) {
			if (fillwindow(&window, fp))
				goto again;
		}

		(void) fputs(window.line1, stdout);

		shiftwindow(&window);
	}

	free(window.line3);
	window.line3 = NULL;

	while (window.line1 != NULL) {
		one(&window);

		(void) fputs(window.line1, stdout);

		shiftwindow(&window);
	}
}

This is why we have the three function return an int: we want to know when this optimizaton was successful precisely because we want to be able to know if we can chain together multiple three-line optimizations. If for some reason we can't chain three-line optimizations because there isn't enough original assembly code to do so, that's fine; we can just continue as normal and things will work out OK.

A completed peephole optimizer

You can see the completed version of the peephole optimizer we wrote in these two blog posts here. There may be additional optimizations added in the future, so make sure to check out the main branch from time to time as well.

I hope you enjoyed this mini-series on how to write peephole optimizers. While they are little more than simple pattern matching programs, as we saw they can help fix up code and enable optimizations that more powerful techniques can sometimes miss.

If you'd like to try implementing some more one-line optimizations to test your skills, some other optimizations that are great for peephole optimizers are elimination of direct register-to-register self-copies (e.g., removing movq %rax, %rax), elimination of addition by zero (e.g., removing addq $0, %rax), and replacement of multiplication by a positive power of two with left arithmetic shift (e.g., replacing imull $8, %eax with sall $3, %eax).

If you find additional optimizations that QBE misses, feel free to open a Pull Request (and let QBE upstream know about the potential optimizations too!).

Dr. Brian Robert Callahan

[prev]

[next]

2022-04-02
I wrote a simple, explainable, peephole optimizer (for cproc and QBE), part 2

A new data structure

Rewriting the one-line optimization passes

Rewriting the three-line optimization pass

One last tweak

A completed peephole optimizer

Dr. Brian Robert Callahan

[prev]

[next]

2022-04-02I wrote a simple, explainable, peephole optimizer (for cproc and QBE), part 2

A new data structure

Rewriting the one-line optimization passes

Rewriting the three-line optimization pass

One last tweak

A completed peephole optimizer

2022-04-02
I wrote a simple, explainable, peephole optimizer (for cproc and QBE), part 2