Dr. Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life



[prev]
[next]

2025-09-01
Let's write a peephole optimizer for QBE's arm64 backend

TL;DR: Get the code here.

A long time ago, we wrote a series of three blog posts tackling building a peephole optimizer for QBE, "a compiler backend that aims to provide 70% of the performance of industrial optimizing compilers in 10% of the code." It's a small utility written in C that understands its own intermediate language and outputs assembly language for AMD64, AArch64, and 64-bit RISC-V architectures, in ELF (aka, all modern Unixes not named macOS) and Mach-O (aka, all modern Unixes named macOS) formats. That great because it means that we can all use QBE for our hobbyist compiler projects and it will work on just about any slightly modern machine we may have.

And while QBE does a great job, the remaining 30% of performance left on the table means there are opportunities for us to come along and analyze its output and come up with peephole optimizers to get us a little bit closer to that 100%. Our previous efforts were focused on the AMD64 (aka, x86_64) backend. Today, let's take a look at improvements we can make to the AArch64 (aka, ARM64) backend.

My interest here is that I am finishing up a Forth-like language compiler that compiles to QBE. I am at the point where I am curious what kind of code QBE outputs on AArch64.

The only laptop devices I have that use AArch64 are my Macs, so that's what I'll be using. I am not going to make any optimizations that directly depend on being on a Mac, so the peephole optimizer will work no matter what operating system you are using.

We will improve O, the peephole optimizer we already have that works on AMD64 assembly.

Reading over some AArch64 assembly

I had my compiler generate the QBE IL for my Forth standard library, then passed that through QBE to get AArch64 assembly code. Looking through it, I think I found three examples where we can improve the code created by QBE. One of them is intertwined with another, but that makes things a little more fun.

For now, I am using the latest release of QBE, 1.2 as of this writing. Later on, I switched to using the latest git checkout because it has better macOS support.

Eliminating useless register-to-register copies

In one location in the assembly code, I found a peculiar idiom:

	mov	x0, x20
	mov	x20, x0

If we copy x20 to x0, we do not need to copy x0 right back to x20; they are already the same value. We can simply eliminate the second mov instruction. Indeed, the latest git checkout of QBE already does this but I chose to leave the code in O in case you are using the latest release of QBE.

The quick and dirty hack version to make this optimization looks like this:

static int
mov(struct peephole *window)
{
	int c, i = 0, j = 0;
	char r1[4], r2[4], r3[4], r4[4];

	if (window->line1 == NULL || window->line2 == NULL)
		return 0;

	if (strncmp("\tmov\tx", window->line1, 6) != 0 ||
	    strncmp("\tmov\tx", window->line2, 6) != 0) {
		return 0;
	}

	(void) memset(r1, 0, 4);

	c = window->line1[i + 5];

	while (c != ',') {
		if (i == 3)
			return 0;
		r1[i++] = c;
		c = window->line1[i + 5];
	}

	while (c == ',' || c == ' ') {
		c = window->line1[i + 5];
		if (c == '\n')
			return 0;
		++i;
	}

	(void) memset(r2, 0, 4);

	while (c != '\n') {
		if (j == 3)
			return 0;
		r2[j++] = c;
		c = window->line1[i + 5];
		++i;
	}

	i = 0;
	(void) memset(r3, 0, 4);

	c = window->line2[i + 5];

	while (c != ',') {
		if (i == 3)
			return 0;
		r3[i++] = c;
		c = window->line2[i + 5];
	}

	while (c == ',' || c == ' ') {
		c = window->line2[i + 5];
		if (c == '\n')
			return 0;
		++i;
	}

	j = 0;
	(void) memset(r4, 0, 4);

	while (c != '\n') {
		if (j == 3)
			return 0;
		r4[j++] = c;
		c = window->line2[i + 5];
		++i;
	}

	if (r1[0] != 'x' || r2[0] != 'x' || r3[0] != 'x' || r4[0] != 'x')
		return 0;

	if (strcmp(r1, r4) != 0 || strcmp(r2, r3) != 0)
		return 0;

	free(window->line2);
	window->line2 = xstrdup(window->line3);

	free(window->line3);
	window->line3 = NULL;

	return 1;
}

In short, check to see if we have two consecutive lines of assembly that contain mov instructions, then see if we have four 64-bit registers across both instructions, then check that the first register of the first instruction matches the second register of the second instruction and vice-versa. If all those things are true, eliminate the second mov instruction since it does not do anything.

This is a win because it eliminates an instruction without modifying the resulting code in any way.

Perform arithmetic directly with immediates instead of going through a temporary register

Next, I noticed this idiom:

	mov	x0, #1
	add	x0, x19, x0

I also noticed the same idiom with the sub and lsl instructions. Let's break this down. We move an immediate value into a register, then perform some arithmetic between that register and another register, storing the result in the original register. In pseudo-code, it might be something like:

	var x = 1;
	x = y + x;

We could reduce that down to simply:

	var x = y + 1;

There is no need to put the immediate value into a register when AArch64 machines can perform arithmetic between a register and an immediate. Now, there are some caveats, such as the immediate value needs to be between 0 and 4095 (inclusive). But a lot of the time, we are adding small values, so this is almost certainly going to be a win for a significant number of operations.

In code, again as a quick and dirty hack:

static int
merge_immediate(struct peephole *window)
{
	int c, i = 0, j = 0, r;
	char buf[64], imm[6], r1[4], r2[4], r3[4], tmp[32];

	if (window->line1 == NULL || window->line2 == NULL)
		return 0;

	if (strncmp("\tmov\tw", window->line1, 6) != 0 &&
	    strncmp("\tmov\tx", window->line1, 6) != 0) {
		return 0;
	}

	if (strncmp("\tlsl\t", window->line2, 5) != 0 &&
	    strncmp("\tadd\t", window->line2, 5) != 0 &&
	    strncmp("\tsub\t", window->line2, 5) != 0) {
		return 0;
	}

	(void) memset(r1, 0, 4);

	c = window->line1[i + 5];

	while (c != ',') {
		if (i == 3)
			return 0;
		r1[i++] = c;
		c = window->line1[i + 5];
	}

	while (c == ',' || c == ' ') {
		c = window->line1[i + 5];
		if (c == '\n')
			return 0;
		++i;
	}

	(void) memset(imm, 0, 6);

	while (c != '\n') {
		if (j == 5)
			return 0;
		imm[j++] = c;
		c = window->line1[i + 5];
		++i;
	}

	if (imm[0] != '#')
		return 0;

	i = atoi(imm + 1);
	if (i < 0 || i > 4095)
		return 0;
	if (strncmp("\tlsl\t", window->line2, 5) == 0) {
		if (i > 63)
			return 0;
	}

	i = 0;
	c = window->line2[i + 5];
	j = 0;

	(void) memset(r2, 0, 4);

	while (c != ',') {
		if (j == 3)
			return 0;
		r2[j++] = c;
		++i;
		c = window->line2[i + 5];
	}

	++i;
	c = window->line2[i + 5];

	while (c != ',') {
		++i;
		c = window->line2[i + 5];
	}

	r = i + 7;

	i += 2;
	c = window->line2[i + 5];

	j = 0;
	(void) memset(r3, 0, 4);

	while (c != '\n') {
		if (j == 3)
			return 0;
		r3[j++] = c;
		++i;
		c = window->line2[i + 5];
	}

	r1[0] = 'x';

	if (strcmp(r1, r2) == 0 && strcmp(r1, r3) == 0) {
		(void) memset(tmp, 0, 32);
		for (i = 0; i < r; ++i) {
			if (i == 31)
				return 0;
			tmp[i] = window->line2[i];
		}
		(void) snprintf(buf, 64, "%s%s\n", tmp, imm);

		free(window->line1);
		window->line1 = xstrdup(buf);

		free(window->line2);
		window->line2 = xstrdup(window->line3);

		free(window->line3);
		window->line3 = NULL;

		return 1;
	}

	return 0;
}

We check to see that we have two assembly lines: the first is a mov and the second is one of add, sub, or lsl. Ensure that the first instruction moves an immediate value in the range 0 .. 4095 into a register, or 0 .. 63 if the second line has an lsl instruction, and that the register in the mov instruction matches both the first and last registers in the arithmetic instruction. If all that is true, then we can replace the last register in the arithmetic instruction with the immediate from the mov instruction and eliminate the mov instruction.

Now, this code:

	mov	x0, #1
	add	x0, x19, x0

Gets optimized to:

	add	x0, x19, #1

Which saves one instruction.

I also noticed that QBE also likes this related but importantly different idiom:

	mov	x1, #16
	sub	sp, sp, x1

Sometimes we see a mov x1, #0 instruction or something similar immediately following this idiom. But not always. When we do see a mov like this following this idiom, we can safely conclude within our three-line window that we could perform the optimization, assuming the immediate is within range. But if we don't have an instruction that immediately clobbers the temporary register, we cannot safely make this optimization, though in practice I believe it is always safe to do with QBE-generated code. But we can't prove it; in theory the value in x1 could be used as an input in a subsequent instruction. But, again, in practice I think it is safe for all QBE-generated code (at least for now).

A few times we had an idiom that was a combination of the previous two:

	mov	x1, #16
	sub	sp, sp, x1
	mov	x1, #0
	add	x1, sp, x1

This could be reduced to:

	sub	sp, sp, #16
	add	x1, sp, #0

But I also left that out. We do have enough information in our sliding three-line peephole to prove the optimizations so it could be done. I just haven't done it yet.

But there is something we can do with the add x1, sp, #0 line.

Adding 0 to/from a register is a mov

According to the Arm developer reference, adding 0 to or from sp is equivalent to mov to or from sp. That means we do not actually need to change add x1, sp, #0 to mov x1, sp but I am going to anyway; I think it makes it a little more clear to the human reader what we aim to accomplish in the code. Plus it gives us another optimization to implement.

Note that this equivalence of mov and add #0 only applies when adding 0 to or from sp. Adding 0 to or from any other register becomes an orr, but we can make the same optimization here. The orr and add instructions do not set flags; you need the adds instruction if you want to set flags.

Code for this optimization:

static void
add(struct peephole *window)
{
	int c, i = 0, j = 0, r;
	char imm[3], r1[4];

	if (window->line1 == NULL)
		return;

	if (strncmp("\tadd\tx", window->line1, 6) == 0) {
		c = window->line1[i++];
		while (c != ',') {
			if (c == '\n' || c == '\0')
				return;
			c = window->line1[i++];
		}
		++i;
		c = window->line1[i++];
		while (c != ',') {
			if (j == 3)
				return;
			r1[j++] = c;
			c = window->line1[i++];
		}
		r = i;
		++i;

		(void) memset(imm, 0, 3);

		j = 0;
		c = window->line1[i++];
		while (c != '\n') {
			if (j == 2)
				return;
			imm[j++] = c;
			c = window->line1[i++];
		}
		if (strcmp(imm, "#0") == 0) {
			window->line1[1] = 'm';
			window->line1[2] = 'o';
			window->line1[3] = 'v';
			window->line1[r - 1] = '\n';
			window->line1[r] = '\0';
		}
	}
}

We check if we are adding 0 to a register, and if we are, we change add to mov and remove the immediate value from the instruction.

Taken with the previous optimization, we can optimize:

	mov	x0, #0
	add	x0, x1, x0

To:

	mov	x0, x1

Neat.

Conclusion

One significant difference between optimizing for AArch64 compared to AMD64 is in instruction length; AMD64 has variable-length instructions so it was possible that two instructions that did the same thing might be encoded into instructions of different lengths. However, AArch64 instructions are all four bytes in length. So we cannot made any code size wins by replacing instructions with an indentical number of instructions. But we can win on code size if we remove instructions. We might also win on execution speed but I did not measure it.

In any event, we able to reduce our code by a tiny bit which feels nice.

Top

RSS