Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life



[prev]
[next]

2022-03-30
I wrote a peephole optimizer (for QBE), part 1

All source code for this blog post can be found here.

One of the things we did not tackle when writing our PL/0 compiler was an optimizer. Our code generator output C and then we let our C compiler handle the rest. If we wanted optimizations, we could turn them on in the C compiler. That's great but it didn't help us learn anything about optimizations.

I test a lot of C compilers when testing oksh. One of those compilers is called cproc. It's a C11 compiler that compiles C into QBE and then has QBE transform its intermediate language into assembly, which then follows the usual pipeline of being run through the system assembler and ultimately the system linker.

We saw QBE before when we wrote our brainfuck compiler in QBE IL. When observing how QBE transforms its IL into assembly, we noticed that it missed a chance at optimization in the case where you moved a zero into a register. The generated assembly used a mov instruction to move an immediate of zero into the register. But the fastest and smallest way to put a zero into a register (at least, on amd64 machines) is to xor the register with itself.

I noticed the lastest cproc, which uses the latest QBE, still fails to make this optimization. At one point in time, the author of cproc sent me a diff for QBE that did make the optimization but I guess it was never applied to the QBE repository. As the savings for this optimization really can add up over a large project, and since it gave us an opportunity to look at peephole optimizers, I decided to write a small program that could be run in the cproc pipeline in between QBE and the system assembler to fix up this optimization. Maybe we could find more optimizations as well.

A peephole optimizer

The simplest type of optimizer that we can write without having to dive into the vast literature on compiler optimizations is a peephole optimizer. In this technique, we read in some number of lines of assembly and perform basic pattern matching on those lines looking for instructions that we know can be replaced with better ones (or entirely eliminated). "Better" in this case brings up the first interesting question about optimization: better can be contextual based on your goals. Do you want instructions that take less time to execute? Do you want instructions that require fewer bytes to encode, making for a smaller program? Someting else? What is faster may not be smaller. What is smaller may not be faster. For us with our first optimization, we are lucky that the optimized version is both faster and smaller.

The peephole, or window, for our first optimization is just one line. There are some other peephole optimizations that can be performed with a peephole of just one line: removing useless instructions, such as moving a register into itself or adding zero to a register, immediately come to mind.

After we've completed the mov to xor optimization, we'll tackle an additional optimization that requires a larger peephole to enable.

We will name our peephole optimizer O in honor of the C compiler flag (-O) that is commonly used to tell the C compiler to turn on its optimizer.

Choosing a language

I am going to write O in C. Not because C is necessarily the best language for this task; it probably isn't. But because if we write O in C, then we can run cproc on O itself and determine the size savings on itself. I am also going to rely on direct knowledge of cproc to make our lives easier when figuring out what strings to match on. There is also a little bit of argc and argv handling with that knowledge so that O can be plugged into cproc when we are finished.

With all of that said, our main function looks like this:

static int
usage(void)
{

	(void) fputs("usage: O in.s [-o out.s]\n", stderr);

	return 1;
}

int
main(int argc, char *argv[])
{
	FILE *fp;

	if (argc == 4) {
		if (strcmp(argv[2], "-o") != 0)
			return usage();

		if (freopen(argv[3], "w+", stdout) == NULL) {
			(void) fprintf(stderr, "O: error: couldn't open %s\n",
			    argv[3]);
		}
	} else if (argc != 2) {
		return usage();
	}

	if (!strcmp(argv[1], "-")) {
		O(stdin);

		return 0;
	}

	if ((fp = fopen(argv[1], "r")) == NULL) {
		(void) fprintf(stderr, "O: error: couldn't open %s\n",
		    argv[1]);

		return 1;
	}

	O(fp);

	(void) fclose(fp);

	return 0;
}

Because of how cproc works, when the final stage of the compilation is to output assembly (-S), cproc automatically appends -o [file.s] so O needs to support that. If cproc is going to write an object file or binary, then O will write to stdout and the assembler will read the output in via a pipe. Additionally, input from cproc is always stdin, demarcated as -, so we need to support that too.

To avoid too much complexity, if we do have to output to a file, we can freopen(3) that file to stdout. Most of the time, we will be outputting to stdout so it makes sense to make outputting to a file the special case.

Reading in one line at a time

We can use the getline(3) function to read in one line at a time into a buffer. While the getline(3) function is "new" with POSIX.1-2008, I think all modern Unix systems have the function. If your system does not, you can get a copy of getline(3) here.

Reading in one line at a time might look something like this:

static void
O(FILE *fp)
{
	char *line = NULL;
	size_t size = 0;

	while (getline(&line, &size, fp) != -1)
		(void) fputs(line, stdout);

	free(line);
}

Yes, you really do need the call to free(3) at the end there.

This would read in one line at a time and then immediately print it out. Of course, we need to perform a check to match for potential lines to replace, so let's improve this a bit:

static void
one(const char *line)
{

	if (xorq(line))
		return;

	if (xorl(line))
		return;

	(void) fputs(line, stdout);
}

static void
O(FILE *fp)
{
	char *line = NULL;
	size_t size = 0;

	while (getline(&line, &size, fp) != -1)
		one(line);

	free(line);
}

Now we are reading in one line at a time, and checking if it's a movq line that can be replaced with an xorq line. If not, perhaps it is a movl line that can be replaced with an xorl line. If not, then it is not a line that can be optimized this way so we should just print it out as-is.

Find and replace

Because we know exactly what QBE is going to output, we can use that to our advantage. We don't need to tokenize the output; we can simply check for exact strings. That saves us a lot of logic. A movq line that we want to match will always look like this:

	movq $0, %r__
Where __ reflects any of the 64-bit registers. Therefore, we can match exactly that and since that's the only form this line can be in, it is the only thing we need to check for. We want to replace that movq line with this:

	xorq %r__, %r__
If we were to write that as a function, it might look like this:

static int
xorq(const char *line)
{

	if (!strncmp("\tmovq $0, %r", line, 12)) {
		(void) fprintf(stdout, "\txorq %%r%c%c, %%r%c%c\n", line[12],
		    line[13], line[12], line[13]);

		return 1;
	}

	return 0;
}

If you wanted to be paranoid, you could check that strlen(line) > 13 before calling fprintf.

We check to see if the first twelve characters of the line match a movq line that would be better if it was optimized into a xorq line. If it does, we instead write the equivalent xorq line. We are able to use the fact that we can learn the target register directly from the movq line and use that in the rewritten xorq line.

We can use the same strategy for optimizing movl lines into xorl lines. Here, we want to transform:

	movl $0, %e__

Into:

	xorl %e__, %e__

In code, that looks like this:

static int
xorl(const char *line)
{

	if (!strncmp("\tmovl $0, %e", line, 12)) {
		(void) fprintf(stdout, "\txorl %%e%c%c, %%e%c%c\n", line[12],
		    line[13], line[12], line[13]);

		return 1;
	}

	return 0;
}

We have written a peephole optimizer! It is guaranteed to improve the code that QBE produces.

Improving these optimizations

But we can do better. The xorq function doesn't account for %r8 or %r9 while the xorl function doesn't account for any of the %r8d-%r15d registers. They'll need special tweaks, but it is nothing too difficult:

static int
xorq(struct peephole *window)
{
	char buf[32], r1a, r1b;

	if (window->line1 == NULL)
		return 0;

	if (!strncmp("\tmovq $0, %r", window->line1, 12)) {
		if (strlen(window->line1) < 14)
			return 0;

		r1a = window->line1[12];
		r1b = window->line1[13];

		if (r1b == '\n')
			r1b = ' ';

		(void) snprintf(buf, sizeof(buf), "\txorq %%r%c%c, %%r%c%c\n",
		    r1a, r1b, r1a, r1b);

		free(window->line1);
		window->line1 = xstrdup(buf);

		return 1;
	}

	return 0;
}

static int
xorl(struct peephole *window)
{
	char buf[32], e1a, e1b;

	if (window->line1 == NULL)
		return 0;

	if (!strncmp("\tmovl $0, %e", window->line1, 12)) {
		if (strlen(window->line1) != 15)
			return 0;

		e1a = window->line1[12];
		e1b = window->line1[13];

		(void) snprintf(buf, sizeof(buf), "\txorl %%e%c%c, %%e%c%c\n",
		    e1a, e1b, e1a, e1b);

		free(window->line1);
		window->line1 = xstrdup(buf);

		return 1;
	} else if (!strncmp("\tmovl $0, %r", window->line1, 12)) {
		if (strlen(window->line1) < 14)
			return 0;

		e1a = window->line1[12];
		e1b = window->line1[13];

		if (e1b == 'd') {
			(void) snprintf(buf, sizeof(buf),
			    "\txorl %%r%cd, %%r%cd\n", e1a, e1a);
		} else {
			(void) snprintf(buf, sizeof(buf),
			    "\txorl %%r%c%cd, %%r%c%cd\n", e1a, e1b, e1a, e1b);
		}

		free(window->line1);
		window->line1 = xstrdup(buf);

		return 1;
	}

	return 0;
}

It is really important that we make these tweaks, since QBE really enjoys using the %r8d register. That way, we can benefit from this optimization in those cases.

Testing

While we are not yet done with our peephole optimizer, since we will add another optimization, it would be good to test it out. We can test it out on the commandline by running O -. Here is the results when I type in some assembly directly:

/home/brian $ O -
        movq $0, %rax
        xorq %rax, %rax
        movl $0, %ebx
        xorl %ebx, %ebx
        movq $1, %rax
        movq $1, %rax

I typed in the odd-numbered lines and O responded with the even-numbered lines. Don't forget that you need to type in the leading tab, since O requires the leading tab to make a successful match. We can see that right now the optimizer successfully replaces lines that match our logic and, importantly, successfully does not replace lines that do not match.

Plugging O into cproc

We can execute a much bigger test by plugging O into the cproc pipeline. We should immediately begin to see improvements to the code we compile. And, since my OpenBSD port of cproc does the whole three stage self-hosting compile, we will immediately learn if cproc can handle this improvement to itself. I'm betting that it can, but it's always better to be sure.

There are two files that need to be patched. First, we need to patch the configure script and teach it about O, since that will place a needed data structure in a header file to be used later:

Index: configure
--- configure.orig
+++ configure
@@ -159,6 +159,7 @@ static const char *const preprocesscmd[] = {
        "-D", "__extension__=",
 $defines};
 static const char *const codegencmd[]    = {"$DEFAULT_QBE"};
+static const char *const optimizecmd[]   = {"O"};
 static const char *const assemblecmd[]   = {"$DEFAULT_ASSEMBLER"};
 static const char *const linkcmd[]       = {"$DEFAULT_LINKER", $linkflags};
 EOF

Second, we need to patch driver.c to teach the compiler driver about this new stage in the compilation pipeline:

Index: driver.c
--- driver.c.orig
+++ driver.c
@@ -32,6 +32,7 @@ enum stage {
 	PREPROCESS,
 	COMPILE,
 	CODEGEN,
+	OPTIMIZE,
 	ASSEMBLE,
 	LINK,
 };
@@ -60,6 +61,7 @@ static struct stageinfo stages[] = {
 	[PREPROCESS] = {.name = "preprocess"},
 	[COMPILE]    = {.name = "compile"},
 	[CODEGEN]    = {.name = "codegen"},
+	[OPTIMIZE]   = {.name = "optimize"},
 	[ASSEMBLE]   = {.name = "assemble"},
 	[LINK]       = {.name = "link"},
 };
@@ -381,6 +383,7 @@ main(int argc, char *argv[])
 	arrayaddbuf(&stages[PREPROCESS].cmd, preprocesscmd, sizeof(preprocesscmd));
 	arrayaddptr(&stages[COMPILE].cmd, compilecommand(argv[0]));
 	arrayaddbuf(&stages[CODEGEN].cmd, codegencmd, sizeof(codegencmd));
+	arrayaddbuf(&stages[OPTIMIZE].cmd, optimizecmd, sizeof(optimizecmd));
 	arrayaddbuf(&stages[ASSEMBLE].cmd, assemblecmd, sizeof(assemblecmd));
 	arrayaddbuf(&stages[LINK].cmd, linkcmd, sizeof(linkcmd));
 
@@ -400,6 +403,7 @@ main(int argc, char *argv[])
 	arrayaddptr(&stages[COMPILE].cmd, arch);
 	arrayaddptr(&stages[CODEGEN].cmd, "-t");
 	arrayaddptr(&stages[CODEGEN].cmd, qbearch);
+	arrayaddptr(&stages[OPTIMIZE].cmd, "-");
 
 	for (;;) {
 		++argv, --argc;
@@ -414,10 +418,10 @@ main(int argc, char *argv[])
 			switch (input->filetype) {
 			case ASM:    input->stages =                                     1<<ASSEMBLE|1<<LINK; break;
 			case ASMPP:  input->stages = 1<<PREPROCESS|                      1<<ASSEMBLE|1<<LINK; break;
-			case C:      input->stages = 1<<PREPROCESS|1<<COMPILE|1<<CODEGEN|1<<ASSEMBLE|1<<LINK; break;
+			case C:      input->stages = 1<<PREPROCESS|1<<COMPILE|1<<CODEGEN|1<<OPTIMIZE|1<<ASSEMBLE|1<<LINK; break;
 			case CHDR:   input->stages = 1<<PREPROCESS                                          ; break;
-			case CPPOUT: input->stages =               1<<COMPILE|1<<CODEGEN|1<<ASSEMBLE|1<<LINK; break;
-			case QBE:    input->stages =                          1<<CODEGEN|1<<ASSEMBLE|1<<LINK; break;
+			case CPPOUT: input->stages =               1<<COMPILE|1<<CODEGEN|1<<OPTIMIZE|1<<ASSEMBLE|1<<LINK; break;
+			case QBE:    input->stages =                          1<<CODEGEN|1<<OPTIMIZE|1<<ASSEMBLE|1<<LINK; break;
 			case OBJ:    input->stages =                                                 1<<LINK; break;
 			default:     usage("reading from standard input requires -x");
 			}
@@ -505,7 +509,7 @@ main(int argc, char *argv[])
 				arrayaddptr(&stages[PREPROCESS].cmd, "-P");
 				break;
 			case 'S':
-				last = CODEGEN;
+				last = OPTIMIZE;
 				break;
 			case 's':
 				arrayaddptr(&stages[LINK].cmd, "-s");

Perhaps not too surprising, adding these patches and running cproc through a three stage compile produced a working compiler.

Some preliminary numbers

It's a bit unfair to compare numbers for cproc itself, since we literally added code and therefore it is not a fair fight. But what we can do is test O itself, once with the original cproc and again with the new cproc.

Here is the output of size(1) on O.o compiled with the original cproc:

text    data    bss     dec     hex
975     138     0       1113    459

And again with the new cproc:

text    data    bss     dec     hex
948     138     0       1086    43e

That's a savings of 27 bytes, or about a 2.77% reduction in binary size. That's not a huge amount, but it's also not nothing.

For comparison, with clang -O2 we get this size:

text    data    bss     dec     hex
914     16      0       930     3a2

And with clang -Oz, that becomes:

text    data    bss     dec     hex
866     24      0       890     37a

But the big winner is gcc -Os (gcc (GCC) 12.0.1 20220310 (experimental)):

text    data    bss     dec     hex
713     0       0       713     2c9

So while QBE can't quite compete with the likes of GCC and Clang, it isn't designed to. It's designed to be fast and good enough, and it is. With O, QBE is a little bit better without sacrificing too much compilation speed. After all, one of the selling points of cproc is that compiles are much faster compared to GCC and Clang.

But O is a very small codebase. Let's try something larger. Let's try oksh.

Here is the size measurements of oksh built with the old cproc:

text    data    bss     dec     hex
278585  17363   29920   325868  4f8ec

And with the new cproc:

text    data    bss     dec     hex
269745  17363   29920   317028  4d664

That's a savings of 8840 bytes, or about a 3.17% reduction in binary size. Again, it's not huge but it's also not nothing. Remember too it's also a speed increase on top of a size decrease, so we win with this optimization on both levels. While I'm sure the speed increase isn't much more than negligible, over a large enough span of time it probably adds up.

How about something large, like SQLite? With the old cproc:

text    data    bss     dec     hex
1164826 90104   888     1255818 13298a

And with the new cproc:

text    data    bss     dec     hex
1132438 90104   888     1223430 12ab06

That's a savings of 32388 bytes, or about a 2.78% reduction in binary size. I'm starting to see a trend. We're consistently getting between 2% and 3% reduction in binary size with this one optimization.

A more complicated second optimization

When reading through assembly generated by cproc, I discovered a peculiar construct:

	movq %rdi, %r12
.Lbb28:
	movl %ebx, %edi
	movl %edi, %ebx
	movq %r12, %rdi

It looks like this piece of assembly wants to use %rdi as a temporary register for a register-to-register copy, so it is copying itself to %r12, performing the register-to-register copy, then returning to its original value. But there's an opportunity for an optimization for this particular register-to-register copy: all we're doing is moving %ebx into itself. We can say with certainty that the last three lines of this construct can be reduced to just the last line, since copying a register into itself via a temporary register and then immediately reassigning the temporary register is equivalent to simply assigning the final value to the temporary register and not bothering with the copy.

Interestingly, we cannot actually say with certainty if moving %rdi into %r12 and back again is useless. It could be the case that it is useless, in which case we might be able to remove all four lines of assembly, leaving just the label. But it also might be the case that the very next line of assembly is performing an operation on %r12 that is dependent on %r12 having previously gotten a copy of the value of %rdi. But there's also a label in the middle: so perhaps there is logic later where a value for the next iteration of a loop is placed in %r12 en route to %rdi. We can't know any of that from the context we have in this small five-line window.

That brings us to another interesting problem with peephole optimizers: we need to be fairly conservative in our selection of optimizations. We should only implement optimizations that we can be absolutely certain have no undesirable side effects. The eliminaton of the third and fourth lines in that five-line window above has no side effects so that is the optimization we will teach O.

Setting up a larger window

First, we will need to slightly amend our O function:

static int
three(const char *line, FILE *fp)
{

	if (mov(line, fp) == 1)
		return 1;

	return 0;
}

static void
O(FILE *fp)
{
	char *line = NULL;
	size_t size = 0;

	while (getline(&line, &size, fp) != -1) {
		if (three(line, fp) == 1)
			break;
	}

	free(line);
}

The three function will call the one function when appropriate, so we will still benefit from our previous work.

We only have one optimization for a three-line peephole, which removes a useless register self-copy:

static int
mov(const char *line, FILE *fp)
{
	char *line2 = NULL, *line3 = NULL;
	size_t size = 0;

	if (strncmp("\tmovl %e", line, 8) != 0 &&
	    strncmp("\tmovq %r", line, 8) != 0) {
		one(line);

		return 0;
	}

	if (getline(&line2, &size, fp) == -1) {
		(void) fputs(line, stdout);

		return 1;
	}

	if (strncmp("\tmovl %e", line2, 8) != 0 &&
	    strncmp("\tmovq %r", line2, 8) != 0) {
		(void) fputs(line, stdout);
		one(line2);

		free(line2);

		return 0;
	}

	if (getline(&line3, &size, fp) == -1) {
		(void) fputs(line, stdout);
		(void) fputs(line2, stdout);

		free(line2);

		return 1;
	}

	if (strncmp("\tmovl", line3, 5) != 0 &&
	    strncmp("\tmovq", line3, 5) != 0) {
		(void) fputs(line, stdout);
		(void) fputs(line2, stdout);
		one(line3);

		free(line3);
		free(line2);

		return 0;
	}

	if (line[4] == line2[4] &&
	    line[7] == line2[13] &&
	    line[8] == line2[14] &&
	    line[9] == line2[15] &&
	    line[13] == line2[7] &&
	    line[14] == line2[8] &&
	    line[15] == line2[9] &&
	    line[14] == line3[14] &&
	    line[15] == line3[15]) {
		(void) fputs(line3, stdout);
	} else {
		(void) fputs(line, stdout);
		(void) fputs(line2, stdout);
		one(line3);
	}

	free(line3);
	free(line2);

	return 0;
}

There's a lot here. Let's break it down.

We first recognize if the first line we read is a mov with a 32-bit or 64-bit register as the source register. If not, then it definitely cannot be a candidate for this second optimization. However, it might still be a candidate for the first optimization we coded, so we should feed this line into the one function. Regardless of whether or not this line is a successful candidate for the first optimization, after this we are done with this peephole so we should return and start all over again with a new peephole.

You'll notice that we will return zero to say "we did not see the end of the file" and one to say "we have seen the end of the file."

Next is to realize that this optimization only works if we have at least three lines to interrogate. But we may be close to the end of the file; there may not be three lines left. We need to account for that. If there is no next line, then we are definitely done and we should simply print what we have. We know at this point the line we have is not a candidate for the first optimization, or else it would have entered into the previous if statement.

And we continue on. It is exactly the same logic for the second line: we check to make sure it is a line that continues candidacy for this second optimization; we then check to ensure there is a third line that we can use to finish checking for candidacy.

Once we have read in all three lines, we need to know the following:

  1. If lines one and two are the same type of mov (movl or movq)
  2. If the source register in line one is the destination register in line two
  3. If the destination register in line one is the source register in line two
  4. If the destination register in line one is the destination register in line three

The large if statement performs all these checks.

Something interesting is that we don't care if line three is a 32-bit movl or a 64-bit movq. Just so long as the destination register is the same in both line one and line three, that is enough to ensure the first two lines are useless. We do care that the type of mov is the same for lines one and two, because if they are not it is possible that there are side effects that would render those two lines not useless.

This function also doesn't account for the fact that the 32-bit versions of %r12-%r15 are named %r12d-%r15d. I haven't seen any cproc code that uses those 32-bit registers in the type of construct that this optimization eliminates, but it is certainly possible. We'll probably have to write another function at some point that checks for this special case and acts accordingly.

I was able to rebuild cproc with this new optimization enabled. It produced exactly the same binary compared to O that only knew the first optimization. All that means is the construct that this second optimization eliminates wasn't present in the cproc assembly. It also means that we didn't break anything by adding this second optimization. The codebase that did generate this peculiar assembly construct saw that the second optimization worked and the useless copy was successfully optimized away.

Bugs and caveats, to be fixed in part 2

Astute readers may have noticed that our peephole optimizer, as good as it is, may fail to optimize constructs it knows how to optimize. Notably, O will fail to optimize this:

	movl %ecx, %eax
	movl %ebx, %edi
	movl %edi, %ebx
	movq %r12, %rdi

This is because the first three lines get read in, then we learn that this does not match the pattern we are looking for, so we print out all three lines and then read in the fourth line on its own, which doesn't match anything either.

It would be better if we printed out just the first line after discovering that the first three lines do not satisfy any optimization patterns, then check lines two through four for any optimization patterns; if we did that, we could optimize the last three lines.

It will take some additional coding and problem solving to be able to accomplish this goal. We will endeavor to do so in the next post.

Top

RSS