Dr. Brian Robert Callahan
academic, developer, with an eye towards a brighter techno-social life
TL;DR: Get the code here.
A long time ago, we wrote a series of three blog posts tackling building a peephole optimizer for QBE, "a compiler backend that aims to provide 70% of the performance of industrial optimizing compilers in 10% of the code." It's a small utility written in C that understands its own intermediate language and outputs assembly language for AMD64, AArch64, and 64-bit RISC-V architectures, in ELF (aka, all modern Unixes not named macOS) and Mach-O (aka, all modern Unixes named macOS) formats. That great because it means that we can all use QBE for our hobbyist compiler projects and it will work on just about any slightly modern machine we may have.
And while QBE does a great job, the remaining 30% of performance left on the table means there are opportunities for us to come along and analyze its output and come up with peephole optimizers to get us a little bit closer to that 100%. Our previous efforts were focused on the AMD64 (aka, x86_64) backend. Today, let's take a look at improvements we can make to the AArch64 (aka, ARM64) backend.
My interest here is that I am finishing up a Forth-like language compiler that compiles to QBE. I am at the point where I am curious what kind of code QBE outputs on AArch64.
The only laptop devices I have that use AArch64 are my Macs, so that's what I'll be using. I am not going to make any optimizations that directly depend on being on a Mac, so the peephole optimizer will work no matter what operating system you are using.
We will improve O
, the peephole optimizer we already have that works on AMD64 assembly.
I had my compiler generate the QBE IL for my Forth standard library, then passed that through QBE to get AArch64 assembly code. Looking through it, I think I found three examples where we can improve the code created by QBE. One of them is intertwined with another, but that makes things a little more fun.
For now, I am using the latest release of QBE, 1.2 as of this writing. Later on, I switched to using the latest git checkout because it has better macOS support.
In one location in the assembly code, I found a peculiar idiom:
mov x0, x20 mov x20, x0
If we copy x20
to x0
, we do not need to copy x0
right back to x20
; they are already the same value. We can simply eliminate the second mov
instruction. Indeed, the latest git checkout of QBE already does this but I chose to leave the code in O
in case you are using the latest release of QBE.
The quick and dirty hack version to make this optimization looks like this:
static int mov(struct peephole *window) { int c, i = 0, j = 0; char r1[4], r2[4], r3[4], r4[4]; if (window->line1 == NULL || window->line2 == NULL) return 0; if (strncmp("\tmov\tx", window->line1, 6) != 0 || strncmp("\tmov\tx", window->line2, 6) != 0) { return 0; } (void) memset(r1, 0, 4); c = window->line1[i + 5]; while (c != ',') { if (i == 3) return 0; r1[i++] = c; c = window->line1[i + 5]; } while (c == ',' || c == ' ') { c = window->line1[i + 5]; if (c == '\n') return 0; ++i; } (void) memset(r2, 0, 4); while (c != '\n') { if (j == 3) return 0; r2[j++] = c; c = window->line1[i + 5]; ++i; } i = 0; (void) memset(r3, 0, 4); c = window->line2[i + 5]; while (c != ',') { if (i == 3) return 0; r3[i++] = c; c = window->line2[i + 5]; } while (c == ',' || c == ' ') { c = window->line2[i + 5]; if (c == '\n') return 0; ++i; } j = 0; (void) memset(r4, 0, 4); while (c != '\n') { if (j == 3) return 0; r4[j++] = c; c = window->line2[i + 5]; ++i; } if (r1[0] != 'x' || r2[0] != 'x' || r3[0] != 'x' || r4[0] != 'x') return 0; if (strcmp(r1, r4) != 0 || strcmp(r2, r3) != 0) return 0; free(window->line2); window->line2 = xstrdup(window->line3); free(window->line3); window->line3 = NULL; return 1; }
In short, check to see if we have two consecutive lines of assembly that contain mov
instructions, then see if we have four 64-bit registers across both instructions, then check that the first register of the first instruction matches the second register of the second instruction and vice-versa. If all those things are true, eliminate the second mov
instruction since it does not do anything.
This is a win because it eliminates an instruction without modifying the resulting code in any way.
Next, I noticed this idiom:
mov x0, #1 add x0, x19, x0
I also noticed the same idiom with the sub
and lsl
instructions. Let's break this down. We move an immediate value into a register, then perform some arithmetic between that register and another register, storing the result in the original register. In pseudo-code, it might be something like:
var x = 1; x = y + x;
We could reduce that down to simply:
var x = y + 1;
There is no need to put the immediate value into a register when AArch64 machines can perform arithmetic between a register and an immediate. Now, there are some caveats, such as the immediate value needs to be between 0 and 4095 (inclusive). But a lot of the time, we are adding small values, so this is almost certainly going to be a win for a significant number of operations.
In code, again as a quick and dirty hack:
static int merge_immediate(struct peephole *window) { int c, i = 0, j = 0, r; char buf[64], imm[6], r1[4], r2[4], r3[4], tmp[32]; if (window->line1 == NULL || window->line2 == NULL) return 0; if (strncmp("\tmov\tw", window->line1, 6) != 0 && strncmp("\tmov\tx", window->line1, 6) != 0) { return 0; } if (strncmp("\tlsl\t", window->line2, 5) != 0 && strncmp("\tadd\t", window->line2, 5) != 0 && strncmp("\tsub\t", window->line2, 5) != 0) { return 0; } (void) memset(r1, 0, 4); c = window->line1[i + 5]; while (c != ',') { if (i == 3) return 0; r1[i++] = c; c = window->line1[i + 5]; } while (c == ',' || c == ' ') { c = window->line1[i + 5]; if (c == '\n') return 0; ++i; } (void) memset(imm, 0, 6); while (c != '\n') { if (j == 5) return 0; imm[j++] = c; c = window->line1[i + 5]; ++i; } if (imm[0] != '#') return 0; i = atoi(imm + 1); if (i < 0 || i > 4095) return 0; if (strncmp("\tlsl\t", window->line2, 5) == 0) { if (i > 63) return 0; } i = 0; c = window->line2[i + 5]; j = 0; (void) memset(r2, 0, 4); while (c != ',') { if (j == 3) return 0; r2[j++] = c; ++i; c = window->line2[i + 5]; } ++i; c = window->line2[i + 5]; while (c != ',') { ++i; c = window->line2[i + 5]; } r = i + 7; i += 2; c = window->line2[i + 5]; j = 0; (void) memset(r3, 0, 4); while (c != '\n') { if (j == 3) return 0; r3[j++] = c; ++i; c = window->line2[i + 5]; } r1[0] = 'x'; if (strcmp(r1, r2) == 0 && strcmp(r1, r3) == 0) { (void) memset(tmp, 0, 32); for (i = 0; i < r; ++i) { if (i == 31) return 0; tmp[i] = window->line2[i]; } (void) snprintf(buf, 64, "%s%s\n", tmp, imm); free(window->line1); window->line1 = xstrdup(buf); free(window->line2); window->line2 = xstrdup(window->line3); free(window->line3); window->line3 = NULL; return 1; } return 0; }
We check to see that we have two assembly lines: the first is a mov
and the second is one of add
, sub
, or lsl
. Ensure that the first instruction moves an immediate value in the range 0 .. 4095 into a register, or 0 .. 63 if the second line has an lsl
instruction, and that the register in the mov
instruction matches both the first and last registers in the arithmetic instruction. If all that is true, then we can replace the last register in the arithmetic instruction with the immediate from the mov
instruction and eliminate the mov
instruction.
Now, this code:
mov x0, #1 add x0, x19, x0
Gets optimized to:
add x0, x19, #1
Which saves one instruction.
I also noticed that QBE also likes this related but importantly different idiom:
mov x1, #16 sub sp, sp, x1
Sometimes we see a mov x1, #0
instruction or something similar immediately following this idiom. But not always. When we do see a mov
like this following this idiom, we can safely conclude within our three-line window that we could perform the optimization, assuming the immediate is within range. But if we don't have an instruction that immediately clobbers the temporary register, we cannot safely make this optimization, though in practice I believe it is always safe to do with QBE-generated code. But we can't prove it; in theory the value in x1
could be used as an input in a subsequent instruction. But, again, in practice I think it is safe for all QBE-generated code (at least for now).
A few times we had an idiom that was a combination of the previous two:
mov x1, #16 sub sp, sp, x1 mov x1, #0 add x1, sp, x1
This could be reduced to:
sub sp, sp, #16 add x1, sp, #0
But I also left that out. We do have enough information in our sliding three-line peephole to prove the optimizations so it could be done. I just haven't done it yet.
But there is something we can do with the add x1, sp, #0
line.
mov
According to the Arm developer reference, adding 0 to or from sp
is equivalent to mov
to or from sp
. That means we do not actually need to change add x1, sp, #0
to mov x1, sp
but I am going to anyway; I think it makes it a little more clear to the human reader what we aim to accomplish in the code. Plus it gives us another optimization to implement.
Note that this equivalence of mov
and add #0
only applies when adding 0 to or from sp
. Adding 0 to or from any other register becomes an orr
, but we can make the same optimization here. The orr
and add
instructions do not set flags; you need the adds
instruction if you want to set flags.
Code for this optimization:
static void add(struct peephole *window) { int c, i = 0, j = 0, r; char imm[3], r1[4]; if (window->line1 == NULL) return; if (strncmp("\tadd\tx", window->line1, 6) == 0) { c = window->line1[i++]; while (c != ',') { if (c == '\n' || c == '\0') return; c = window->line1[i++]; } ++i; c = window->line1[i++]; while (c != ',') { if (j == 3) return; r1[j++] = c; c = window->line1[i++]; } r = i; ++i; (void) memset(imm, 0, 3); j = 0; c = window->line1[i++]; while (c != '\n') { if (j == 2) return; imm[j++] = c; c = window->line1[i++]; } if (strcmp(imm, "#0") == 0) { window->line1[1] = 'm'; window->line1[2] = 'o'; window->line1[3] = 'v'; window->line1[r - 1] = '\n'; window->line1[r] = '\0'; } } }
We check if we are adding 0 to a register, and if we are, we change add
to mov and remove the immediate value from the instruction.
Taken with the previous optimization, we can optimize:
mov x0, #0 add x0, x1, x0
To:
mov x0, x1
Neat.
One significant difference between optimizing for AArch64 compared to AMD64 is in instruction length; AMD64 has variable-length instructions so it was possible that two instructions that did the same thing might be encoded into instructions of different lengths. However, AArch64 instructions are all four bytes in length. So we cannot made any code size wins by replacing instructions with an indentical number of instructions. But we can win on code size if we remove instructions. We might also win on execution speed but I did not measure it.
In any event, we able to reduce our code by a tiny bit which feels nice.