Dr. Brian Robert Callahan
academic, developer, with an eye towards a brighter techno-social life
All source code for this blog post can be found here.
Let's keep working on O
, our peephole optimizer for cproc
and QBE. Our previous posts resparked some attention over on the QBE mailing list. Today, I want to look at the inc and dec operations and see if it makes sense to replace adds and subs by one with incs and decs. If such replacements are sensible, does QBE already do this? If not, then that's a new set of optimizations that we can teach O
.
Let's break out our compilers. I am running LLVM 13.0.0 and GCC 12.0.1 on OpenBSD. While LLVM 14.0.0 is out as of the end of March 2022, OpenBSD has not yet upgraded.
The first thing for us to do is figure out if LLVM and GCC prefer incs and decs over adds and subs by one, and in what contexts do LLVM and GCC prefer them. To get ourselves started, I wrote an incredibly simple program that would allow the compiler to make a decision about inc or add one:
#include <stdio.h> int main(void) { for (int i = 0; i < 1000; i++) printf("%d\n", i); return 0; }
This loop is a little too large for my compilers to want to unroll it. Running gcc -S inc.c
yields the following assembly:
.file "inc.c" .text .section .rodata .LC0: .string "%d\n" .text .globl main .type main, @function main: .LFB4: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 subq $16, %rsp movl $0, -4(%rbp) jmp .L2 .L3: movl -4(%rbp), %eax movl %eax, %esi leaq .LC0(%rip), %rax movq %rax, %rdi movl $0, %eax call printf@PLT addl $1, -4(%rbp) .L2: cmpl $999, -4(%rbp) jle .L3 movl $0, %eax leave .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE4: .size main, .-main .ident "GCC: (GNU) 12.0.1 20220310 (experimental)"
It appears that GCC in this context prefers to add one directly to the stack. That doesn't seem very good. What if we turn on the usual optimizations and try again with gcc -O2 -S inc.c
:
.file "inc.c" .text .section .rodata.str1.1,"aMS",@progbits,1 .LC0: .string "%d\n" .section .text.startup,"ax",@progbits .p2align 4 .globl main .type main, @function main: .LFB4: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 pushq %r12 .cfi_offset 12, -24 leaq .LC0(%rip), %r12 pushq %rbx .cfi_offset 3, -32 xorl %ebx, %ebx .p2align 4,,10 .p2align 3 .L2: movl %ebx, %esi movq %r12, %rdi xorl %eax, %eax addl $1, %ebx call printf@PLT cmpl $1000, %ebx jne .L2 popq %rbx xorl %eax, %eax popq %r12 popq %rbp .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE4: .size main, .-main .ident "GCC: (GNU) 12.0.1 20220310 (experimental)"
Now it looks like GCC wants to add one to a register. That's a lot better. So GCC for a generic x86_64 processor prefers to add one to a register. LLVM prefers to add one to a register whether you issue -O2
or not.
So problem solved, right? We should always prefer to add one over inc.
Not so fast...
-march
and -mtune
You may have heard of the -march
and -mtune
flags. The GCC documentation spells it out better than I can but here's the quick rundown: these flags permit GCC to assume in the -march
case that our code will run on that CPU (or newer) and therefore can use instructions not found on CPUs older than the one specified. The -mtune
flag instructs GCC to generate code for all machines (or the machine specified by -march
if also specified) but within that constraint generate code most optimized for the CPU specified. It's important to note that -march
implies -mtune
except in some limited cases that are not going to affect us today.
Yes, the GCC documentation has the long list of all the options that can be passed to -march
and -mtune
. But I do not want to do all this by hand. It's too much work and will take too long. Instead, I'm going to write a shell script to iteratively test all the valid options for both LLVM and GCC.
The first task is to collect all the options. Additionally, I wonder if LLVM and GCC have the exact same set of options. If not, I have to collect the options for both. Let's start with GCC. If I use an invalid option to -march
then GCC will give me an error message with a list of all valid options:
/home/brian $ gcc -march=help inc.c cc1: error: bad value ('help') for '-march=' switch cc1: note: valid arguments to '-march=' switch are: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 cannonlake icelake-client rocketlake icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 btver1 btver2 native
Time to break out our Unix skills to print out only the list of options. I also don't want that last native
option, since that option selects the arch that your CPU reports itself to be and therefore is machine-dependent. Here's my quick and dirty one-liner:
/home/brian $ gcc -march=help inc.c 2>&1 | tail -n 1 | awk '{ for (i = 9; i < NF; i++) printf $i" "; }' nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 cannonlake icelake-client rocketlake icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 btver1 btver2
Neat. Let's redirect that into a new file that will become our shell script:
/home/brian $ gcc -march=help inc.c 2>&1 | tail -n 1 | awk '{ for (i = 9; i < NF; i++) printf $i" "; }' > script.sh
Now we need to do the same for LLVM:
/home/brian $ clang -march=help -S inc.c 2>&1 | tail -n 1 | awk '{ for (i = 7; i < NF + 1; i++) printf $i; }' | tr ',' ' ' nocona core2 penryn bonnell atom silvermont slm goldmont goldmont-plus tremont nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 skx cascadelake cooperlake cannonlake icelake-client rocketlake icelake-server tigerlake sapphirerapids alderlake knl knm k8 athlon64 athlon-fx opteron k8-sse3 athlon64-sse3 opteron-sse3 amdfam10 barcelona btver1 btver2 bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 x86-64 x86-64-v2 x86-64-v3 x86-64-v4
I was right. LLVM and GCC have different options. Even though most are the same, it is not exactly the same. Let's also add LLVM's list to our shell script:
/home/brian $ clang -march=help -S inc.c 2>&1 | tail -n 1 | awk '{ for (i = 7; i < NF + 1; i++) printf $i; }' | tr ',' ' ' >> script.sh
Now let's edit our script.sh
file and turn it into something useful. Here's what I came up with:
#!/bin/sh set -A compilers "clang gcc" set -A opts "0 1 2 3 s z g fast" for cc in $compilers ; do if [ "x$cc" = "xclang" ] ; then set -A archs "nocona core2 penryn bonnell atom silvermont slm goldmont goldmont-plus tremont nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 skx cascadelake cooperlake cannonlake icelake-client rocketlake icelake-server tigerlake sapphirerapids alderlake knl knm k8 athlon64 athlon-fx opteron k8-sse3 athlon64-sse3 opteron-sse3 amdfam10 barcelona btver1 btver2 bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 x86-64 x86-64-v2 x86-64-v3 x86-64-v4" else set -A archs "nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 cannonlake icelake-client rocketlake icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 btver1 btver2" fi for opt in $opts ; do addarchs="" adds=0 incs=0 for arch in $archs ; do $cc -O$opt -march=$arch -S -o - inc.c | grep -q 'incl' - if [ $? -eq 0 ] ; then incs=$(($incs+1)) else adds=$(($adds+1)) addarchs="$addarchs $arch" fi done echo "$cc -O$opt:" echo "adds\tincs" echo "----\t----" printf "%d\t%d\n" $adds $incs printf "Add archs:" for arch in $addarchs ; do printf " %s" $arch done printf "\n\n" done done
And here is the output from our script:
clang -O0: adds incs ---- ---- 56 0 Add archs: nocona core2 penryn bonnell atom silvermont slm goldmont goldmont-plus tremont nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 skx cascadelake cooperlake cannonlake icelake-client rocketlake icelake-server tigerlake sapphirerapids alderlake knl knm k8 athlon64 athlon-fx opteron k8-sse3 athlon64-sse3 opteron-sse3 amdfam10 barcelona btver1 btver2 bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 x86-64 x86-64-v2 x86-64-v3 x86-64-v4 clang -O1: adds incs ---- ---- 8 48 Add archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64 clang -O2: adds incs ---- ---- 8 48 Add archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64 clang -O3: adds incs ---- ---- 8 48 Add archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64 clang -Os: adds incs ---- ---- 0 56 Add archs: clang -Oz: adds incs ---- ---- 0 56 Add archs: clang -Og: adds incs ---- ---- 8 48 Add archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64 clang -Ofast: adds incs ---- ---- 8 48 Add archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64 gcc -O0: adds incs ---- ---- 23 39 Add archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4 gcc -O1: adds incs ---- ---- 23 39 Add archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4 gcc -O2: adds incs ---- ---- 23 39 Add archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4 gcc -O3: adds incs ---- ---- 23 39 Add archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4 gcc -Os: adds incs ---- ---- 0 62 Add archs: gcc -Oz: adds incs ---- ---- 0 62 Add archs: gcc -Og: adds incs ---- ---- 23 39 Add archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4 gcc -Ofast: adds incs ---- ---- 23 39 Add archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4
Here we see the number of archs that use inc
and all the archs that use add $1
. We also list the archs that use add $1
to get a better idea of the overall landscape.
There are some extremely interesting things for us to observe here. For one, LLVM and GCC clearly have different priorities for optimization: LLVM almost always prefers to use inc
. LLVM uses inc
86% of the time for everything except -O0
, -Os
, and -Oz
. LLVM will always use inc
when optimizing for size; this is not particularly surprising, since incrementing a register is always smaller than adding one to a register. However, LLVM will always use mov $1
when at -O0
.
GCC on the other hand has a wider spread of when it prefers inc
, about 63% of the time. Interestingly, GCC will use inc
even at -O0
. Like LLVM, GCC will always use inc
when optimizing for size. There even appears to be some agreement on which archs to not use inc
on: the Atom CPU family and its offspring are the primary targets to not use inc
.
These numbers make me wonder why it is that add $1
is preferred over inc
in the generic x86-64
arch with LLVM: LLVM believes that inc
is better for all CPUs except the Atom CPUs. Doubly so since we see that LLVM does prefer inc
for the generic x86-64-v2
, x86-64-v3
, and x86-64-v4
archs. Perhaps LLVM should be changed to prefer inc
when using the generic x86-64
arch. It is not the case that inc
would be unavailable on any x86_64 CPU: the inc
instruction goes all the way back to the original 8086/8088 instruction set. It even dates back to the pre-x86 world, as even the Intel 8080 and Intel 8008 had increment (and decrement) instructions.
However, it may be the case that the world simply has orders of magnitude more Atom family CPUs than any other type of x86_64 CPU. Perhaps the Atom family only makes up 14% of the x86_64 CPU lines released but those 14% of CPU lines make up over 90% of all x86_64 CPUs manufactured. I don't know, but if that is the case then I would say that LLVM should continue to select add $1
for the generic x86-64
arch.
On the other hand, while GCC also selects inc
over add $1
more often, it does not do so nearly as often as LLVM. Like LLVM, GCC also does not select inc
for the generic x86-64
arch. Makes me think that perhaps LLVM is preferring size whereas GCC might be preferring latency and throughput, which you can learn a lot about at this website. GCC makes for a harder call whether or not to want to move to inc
for the generic x86-64
arch. While I think it may make a lot of sense for LLVM to do so because of its apparent focus on code size, it might be the case that the throughput and latency numbers from studying the uops tables makes for a more nuanced approach with GCC. I'd have to study the uops tables more to make more than an educated guess.
Neither LLVM nor GCC are "wrong" in their selections. It just goes to show how there is no one right answer to optimization. You oftentimes have to prioritize one thing over another. And that's OK. And you can even select the compiler whose optimization priorities match yours!
One last question that I have is why does LLVM always use add $1
for -O0
. After all, GCC is smart enough to select inc
at all optimization levels, why shouldn't LLVM also be smart enough. I imagine that the LLVM developers might say something like because inc
and add
affect the FLAGS register differently, specifically the carry flag is not affected by inc
, then you must have some sort of liveness tracking for the carry flag to ensure that there is no conditional instruction that uses the carry flag that is depending on the result of the add $1
, and this liveness check is most appropriately handled in the optimization passes. I can certainly understand that answer even if I don't think it is the best answer.
O
do?Armed with this new knowledge, we can begin to think about whether or not we should implement add $1
to inc
for O
. I think it makes sense for us to follow LLVM's approach of preferring smaller code size. I don't think QBE (and certainly not O
) really needs to have lots of special casing for each individual arch that would allow us to select for throughput and latency best like GCC appears to. Additionally, O
cannot really have much of a liveness check, though it was mentioned on the QBE mailing list. We're just going to take advantage of the fact that it seems like cproc
never issues a conditional instructon without issuing a cmp
immediately before, so we need not worry about O
needing any liveness checks. With that said, since I cannot fully rule out that cproc
could generate code that does have this dependency, I've tweaked our patch to the cproc
driver to debug any cases where O
fails. What I did was make the -O
flag to cproc
track an Oflag
variable, and in the case where you also issue -S
, the Oflag
variable will enable or disable sending the QBE-generaed assembly to O
. Now if there are any problems with O
, I can study the code generated with and without O
and figure out what needs fixing.
Incidentally, this same FLAGS issue could affect our previous mov $0
to xor
optimization, so perhaps having the ability to debug in case of failure would be nice. I can imagine some day when cproc
understands inline assembly, someone could write something that causes O
and its optimizations to be incorrect, but hopefully by then QBE makes these optimizations directly with liveness checks so that O
can move on to other more challenging optimizatons.
add $1
to inc
already?Let's quickly see if QBE is already making this opimization. Let's compile our short test program with cproc -S
:
.data .balign 1 .Lstring.6: .ascii "%d\012\000" /* end data */ .text .globl main main: pushq %rbp movq %rsp, %rbp subq $8, %rsp pushq %rbx xorl %ebx, %ebx .Lbb22: cmpl $1000, %ebx jge .Lbb25 movl %ebx, %esi leaq .Lstring.6(%rip), %rdi xorl %eax, %eax callq printf addl $1, %ebx jmp .Lbb22 .Lbb25: xorl %eax, %eax popq %rbx leave ret .type main, @function .size main, .-main /* end function main */ .section .note.GNU-stack,"",@progbits
So no QBE does not. We'll need to teach QBE about this optimization.
O
about inc
Teaching O
about this optimization is very similar to the mov $0
to xor
optimization: it is a one-line optimization that is a simple one-line match and replace. In code, it looks like this:
static int incq(struct peephole *window) { char buf[32], r1a, r1b; if (window->line1 == NULL) return 0; if (!strncmp("\taddq $1, %r", window->line1, 12)) { if (strlen(window->line1) < 14) return 0; r1a = window->line1[12]; r1b = window->line1[13]; if (r1b == '\n') { (void) snprintf(buf, sizeof(buf), "\tincq %%r%c\n", r1a); } else { (void) snprintf(buf, sizeof(buf), "\tincq %%r%c%c\n", r1a, r1b); } free(window->line1); window->line1 = xstrdup(buf); return 1; } return 0; } static int incl(struct peephole *window) { char buf[32], e1a, e1b; if (window->line1 == NULL) return 0; if (!strncmp("\taddl $1, %e", window->line1, 12)) { if (strlen(window->line1) != 15) return 0; e1a = window->line1[12]; e1b = window->line1[13]; (void) snprintf(buf, sizeof(buf), "\tincl %%e%c%c\n", e1a, e1b); free(window->line1); window->line1 = xstrdup(buf); return 1; } else if (!strncmp("\taddl $1, %r", window->line1, 12)) { if (strlen(window->line1) < 14) return 0; e1a = window->line1[12]; e1b = window->line1[13]; if (e1b == 'd') { (void) snprintf(buf, sizeof(buf), "\tincl %%r%cd\n", e1a); } else { (void) snprintf(buf, sizeof(buf), "\tincl %%r%c%cd\n", e1a, e1b); } free(window->line1); window->line1 = xstrdup(buf); return 1; } return 0; }
And that's it. It's nearly a copy and paste of the xorq
and xorl
functions.
dec
Now let's see if things are the same for sub $1
to dec
. Here is our slightly changed sample program:
#include <stdio.h> int main(void) { for (int i = 1000; i > 0; i--) printf("%d\n", i); return 0; }
I modified our shell script slightly. Here is the output:
clang -O0: subs decs ---- ---- 56 0 Sub archs: nocona core2 penryn bonnell atom silvermont slm goldmont goldmont-plus tremont nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 skx cascadelake cooperlake cannonlake icelake-client rocketlake icelake-server tigerlake sapphirerapids alderlake knl knm k8 athlon64 athlon-fx opteron k8-sse3 athlon64-sse3 opteron-sse3 amdfam10 barcelona btver1 btver2 bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 x86-64 x86-64-v2 x86-64-v3 x86-64-v4 clang -O1: subs decs ---- ---- 8 48 Sub archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64 clang -O2: subs decs ---- ---- 8 48 Sub archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64 clang -O3: subs decs ---- ---- 8 48 Sub archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64 clang -Os: subs decs ---- ---- 0 56 Sub archs: clang -Oz: subs decs ---- ---- 56 0 Sub archs: nocona core2 penryn bonnell atom silvermont slm goldmont goldmont-plus tremont nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 skx cascadelake cooperlake cannonlake icelake-client rocketlake icelake-server tigerlake sapphirerapids alderlake knl knm k8 athlon64 athlon-fx opteron k8-sse3 athlon64-sse3 opteron-sse3 amdfam10 barcelona btver1 btver2 bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 x86-64 x86-64-v2 x86-64-v3 x86-64-v4 clang -Og: subs decs ---- ---- 8 48 Sub archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64 clang -Ofast: subs decs ---- ---- 8 48 Sub archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64 gcc -O0: subs decs ---- ---- 23 39 Sub archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4 gcc -O1: subs decs ---- ---- 23 39 Sub archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4 gcc -O2: subs decs ---- ---- 23 39 Sub archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4 gcc -O3: subs decs ---- ---- 23 39 Sub archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4 gcc -Os: subs decs ---- ---- 0 62 Sub archs: gcc -Oz: subs decs ---- ---- 0 62 Sub archs: gcc -Og: subs decs ---- ---- 23 39 Sub archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4 gcc -Ofast: subs decs ---- ---- 23 39 Sub archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4
Almost everything here appears exactly the same. Wildly, clang -Oz
never uses dec
despite the fact that clang -Os
always uses dec
. This is because LLVM when using -Oz
actually uses a jb
immediately after a sub $1
to save a cmp
! Cool.
OK, so that probably means all the same inferences and same course of action as inc
. Let's teach O
about dec
as well:
static int decq(struct peephole *window) { char buf[32], r1a, r1b; if (window->line1 == NULL) return 0; if (!strncmp("\tsubq $1, %r", window->line1, 12)) { if (strlen(window->line1) < 14) return 0; r1a = window->line1[12]; r1b = window->line1[13]; if (r1b == '\n') { (void) snprintf(buf, sizeof(buf), "\tdecq %%r%c\n", r1a); } else { (void) snprintf(buf, sizeof(buf), "\tdecq %%r%c%c\n", r1a, r1b); } free(window->line1); window->line1 = xstrdup(buf); return 1; } return 0; } static int decl(struct peephole *window) { char buf[32], e1a, e1b; if (window->line1 == NULL) return 0; if (!strncmp("\tsubl $1, %e", window->line1, 12)) { if (strlen(window->line1) != 15) return 0; e1a = window->line1[12]; e1b = window->line1[13]; (void) snprintf(buf, sizeof(buf), "\tdecl %%e%c%c\n", e1a, e1b); free(window->line1); window->line1 = xstrdup(buf); return 1; } else if (!strncmp("\tsubl $1, %r", window->line1, 12)) { if (strlen(window->line1) < 14) return 0; e1a = window->line1[12]; e1b = window->line1[13]; if (e1b == 'd') { (void) snprintf(buf, sizeof(buf), "\tdecl %%r%cd\n", e1a); } else { (void) snprintf(buf, sizeof(buf), "\tdecl %%r%c%cd\n", e1a, e1b); } free(window->line1); window->line1 = xstrdup(buf); return 1; } return 0; }
Lastly, we need to hook these new optimizations up:
static void one(struct peephole *window) { if (xorq(window)) return; if (xorl(window)) return; if (incq(window)) return; if (incl(window)) return; if (decq(window)) return; (void) decl(window); }
Let's measure the size improvements before and after as measured by the size of cproc
itself. I don't anticipate the differences being as large as with the mov $0
to xor
optimization, as the size difference between mov $1
to inc
and sub $1
to dec
is only one byte: from three bytes to two bytes.
Here is the before:
/home/brian $ size /usr/local/bin/cproc text data bss dec hex 16008 2678 584 19270 4b46 /home/brian $ size /usr/local/bin/cproc-qbe text data bss dec hex 97270 20240 964 118474 1ceca
And here is the after:
/home/brian $ size /usr/local/bin/cproc text data bss dec hex 15980 2678 584 19242 4b2a /home/brian $ size /usr/local/bin/cproc-qbe text data bss dec hex 97114 20240 964 118318 1ce2e
For the cproc
executable, that's a reduction of 28 bytes, or a 0.175% size reduction. For the cproc-qbe
executable, we have a reduction of 156 bytes, or a 0.160% size reduction.
This may be one of those optimizations where you will always "win" if size is your concern but if performance is your concern it becomes a lot more difficult to discern if you've "won." For me, I want O
to prefer size, since that is easier for me to control, and the CPU on my machine happens to get identified as a Haswell, one of the CPUs that both LLVM and GCC select inc
for. Your preferences may differ from mine.
Studying how LLVM and GCC selected for these two optimizations led to some interesting surprises, some of which I was not expecting. We learned that, at least for these two optimizations, LLVM and GCC are likely prioritizing different things. Even though I chose to follow the LLVM approach for O
, it would be equally valid to choose to follow the GCC approach. We also learned that LLVM is unable or unwilling (or both) to apply these optimizations at -O0
whereas GCC happily will apply these optimizations at all levels. Finally, we learned that LLVM and GCC agree that these optimizations are universal wins when optimizing for size and will always select these optimizations when using -Os
and -Oz
.
I think QBE is probably better with these two optimizations, even though they are not very big size reductions. If QBE is going to be doing liveness tracking for the FLAGS register for the mov $0
to xor
case, then these two can easily come along for the ride. I think cproc
is definitely better with these optimizations, as it is a small C compiler and as it does not have the ability to run all the optimizations that LLVM and GCC can, anything that can reduce compiled binary size for the executables and libraries it compiles is valuable.
I hope you enjoyed this further look into compiler optimizations. Perhaps we will continue to explore more such compiler optimizations.