Dr. Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life

2022-04-17
I wrote a peephole optimizer (for QBE), part 3: Studying the LLVM and GCC optimizers

All source code for this blog post can be found here.

Let's keep working on O, our peephole optimizer for cproc and QBE. Our previous posts resparked some attention over on the QBE mailing list. Today, I want to look at the inc and dec operations and see if it makes sense to replace adds and subs by one with incs and decs. If such replacements are sensible, does QBE already do this? If not, then that's a new set of optimizations that we can teach O.

What would LLVM and GCC do?

Let's break out our compilers. I am running LLVM 13.0.0 and GCC 12.0.1 on OpenBSD. While LLVM 14.0.0 is out as of the end of March 2022, OpenBSD has not yet upgraded.

The first thing for us to do is figure out if LLVM and GCC prefer incs and decs over adds and subs by one, and in what contexts do LLVM and GCC prefer them. To get ourselves started, I wrote an incredibly simple program that would allow the compiler to make a decision about inc or add one:

#include <stdio.h>

int
main(void)
{
	for (int i = 0; i < 1000; i++)
		printf("%d\n", i);

	return 0;
}

This loop is a little too large for my compilers to want to unroll it. Running gcc -S inc.c yields the following assembly:

	.file	"inc.c"
	.text
	.section	.rodata
.LC0:
	.string	"%d\n"
	.text
	.globl	main
	.type	main, @function
main:
.LFB4:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	subq	$16, %rsp
	movl	$0, -4(%rbp)
	jmp	.L2
.L3:
	movl	-4(%rbp), %eax
	movl	%eax, %esi
	leaq	.LC0(%rip), %rax
	movq	%rax, %rdi
	movl	$0, %eax
	call	printf@PLT
	addl	$1, -4(%rbp)
.L2:
	cmpl	$999, -4(%rbp)
	jle	.L3
	movl	$0, %eax
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE4:
	.size	main, .-main
	.ident	"GCC: (GNU) 12.0.1 20220310 (experimental)"

It appears that GCC in this context prefers to add one directly to the stack. That doesn't seem very good. What if we turn on the usual optimizations and try again with gcc -O2 -S inc.c:

	.file	"inc.c"
	.text
	.section	.rodata.str1.1,"aMS",@progbits,1
.LC0:
	.string	"%d\n"
	.section	.text.startup,"ax",@progbits
	.p2align 4
	.globl	main
	.type	main, @function
main:
.LFB4:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	pushq	%r12
	.cfi_offset 12, -24
	leaq	.LC0(%rip), %r12
	pushq	%rbx
	.cfi_offset 3, -32
	xorl	%ebx, %ebx
	.p2align 4,,10
	.p2align 3
.L2:
	movl	%ebx, %esi
	movq	%r12, %rdi
	xorl	%eax, %eax
	addl	$1, %ebx
	call	printf@PLT
	cmpl	$1000, %ebx
	jne	.L2
	popq	%rbx
	xorl	%eax, %eax
	popq	%r12
	popq	%rbp
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE4:
	.size	main, .-main
	.ident	"GCC: (GNU) 12.0.1 20220310 (experimental)"

Now it looks like GCC wants to add one to a register. That's a lot better. So GCC for a generic x86_64 processor prefers to add one to a register. LLVM prefers to add one to a register whether you issue -O2 or not.

So problem solved, right? We should always prefer to add one over inc.

Not so fast...

Introducing `-march` and `-mtune`

You may have heard of the -march and -mtune flags. The GCC documentation spells it out better than I can but here's the quick rundown: these flags permit GCC to assume in the -march case that our code will run on that CPU (or newer) and therefore can use instructions not found on CPUs older than the one specified. The -mtune flag instructs GCC to generate code for all machines (or the machine specified by -march if also specified) but within that constraint generate code most optimized for the CPU specified. It's important to note that -march implies -mtune except in some limited cases that are not going to affect us today.

Yes, the GCC documentation has the long list of all the options that can be passed to -march and -mtune. But I do not want to do all this by hand. It's too much work and will take too long. Instead, I'm going to write a shell script to iteratively test all the valid options for both LLVM and GCC.

The first task is to collect all the options. Additionally, I wonder if LLVM and GCC have the exact same set of options. If not, I have to collect the options for both. Let's start with GCC. If I use an invalid option to -march then GCC will give me an error message with a list of all valid options:

/home/brian $ gcc -march=help inc.c
cc1: error: bad value ('help') for '-march=' switch
cc1: note: valid arguments to '-march=' switch are: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 cannonlake icelake-client rocketlake icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 btver1 btver2 native

Time to break out our Unix skills to print out only the list of options. I also don't want that last native option, since that option selects the arch that your CPU reports itself to be and therefore is machine-dependent. Here's my quick and dirty one-liner:

/home/brian $ gcc -march=help inc.c 2>&1 | tail -n 1 | awk '{ for (i = 9; i < NF; i++) printf $i" "; }'
nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 cannonlake icelake-client rocketlake icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 btver1 btver2

Neat. Let's redirect that into a new file that will become our shell script:

/home/brian $ gcc -march=help inc.c 2>&1 | tail -n 1 | awk '{ for (i = 9; i < NF; i++) printf $i" "; }' > script.sh

Now we need to do the same for LLVM:

/home/brian $ clang -march=help -S inc.c 2>&1 | tail -n 1 | awk '{ for (i = 7; i < NF + 1; i++) printf $i; }' | tr ',' ' '
nocona core2 penryn bonnell atom silvermont slm goldmont goldmont-plus tremont nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 skx cascadelake cooperlake cannonlake icelake-client rocketlake icelake-server tigerlake sapphirerapids alderlake knl knm k8 athlon64 athlon-fx opteron k8-sse3 athlon64-sse3 opteron-sse3 amdfam10 barcelona btver1 btver2 bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 x86-64 x86-64-v2 x86-64-v3 x86-64-v4

I was right. LLVM and GCC have different options. Even though most are the same, it is not exactly the same. Let's also add LLVM's list to our shell script:

/home/brian $ clang -march=help -S inc.c 2>&1 | tail -n 1 | awk '{ for (i = 7; i < NF + 1; i++) printf $i; }' | tr ',' ' ' >> script.sh

Now let's edit our script.sh file and turn it into something useful. Here's what I came up with:

#!/bin/sh

set -A compilers "clang gcc"
set -A opts "0 1 2 3 s z g fast"

for cc in $compilers ; do
  if [ "x$cc" = "xclang" ] ; then
    set -A archs "nocona core2 penryn bonnell atom silvermont slm goldmont goldmont-plus tremont nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 skx cascadelake cooperlake cannonlake icelake-client rocketlake icelake-server tigerlake sapphirerapids alderlake knl knm k8 athlon64 athlon-fx opteron k8-sse3 athlon64-sse3 opteron-sse3 amdfam10 barcelona btver1 btver2 bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 x86-64 x86-64-v2 x86-64-v3 x86-64-v4"
  else
    set -A archs "nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 cannonlake icelake-client rocketlake icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 btver1 btver2"
  fi

  for opt in $opts ; do
    addarchs=""
    adds=0
    incs=0
    for arch in $archs ; do
      $cc -O$opt -march=$arch -S -o - inc.c | grep -q 'incl' -
      if [ $? -eq 0 ] ; then
        incs=$(($incs+1))
      else
        adds=$(($adds+1))
        addarchs="$addarchs $arch"
      fi
    done

    echo "$cc -O$opt:"
    echo "adds\tincs"
    echo "----\t----"
    printf "%d\t%d\n" $adds $incs
    printf "Add archs:"
    for arch in $addarchs ; do
      printf " %s" $arch
    done
    printf "\n\n"
  done
done

And here is the output from our script:

clang -O0:
adds	incs
----	----
56	0
Add archs: nocona core2 penryn bonnell atom silvermont slm goldmont goldmont-plus tremont nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 skx cascadelake cooperlake cannonlake icelake-client rocketlake icelake-server tigerlake sapphirerapids alderlake knl knm k8 athlon64 athlon-fx opteron k8-sse3 athlon64-sse3 opteron-sse3 amdfam10 barcelona btver1 btver2 bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 x86-64 x86-64-v2 x86-64-v3 x86-64-v4

clang -O1:
adds	incs
----	----
8	48
Add archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64

clang -O2:
adds	incs
----	----
8	48
Add archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64

clang -O3:
adds	incs
----	----
8	48
Add archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64

clang -Os:
adds	incs
----	----
0	56
Add archs:

clang -Oz:
adds	incs
----	----
0	56
Add archs:

clang -Og:
adds	incs
----	----
8	48
Add archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64

clang -Ofast:
adds	incs
----	----
8	48
Add archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64

gcc -O0:
adds	incs
----	----
23	39
Add archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4

gcc -O1:
adds	incs
----	----
23	39
Add archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4

gcc -O2:
adds	incs
----	----
23	39
Add archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4

gcc -O3:
adds	incs
----	----
23	39
Add archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4

gcc -Os:
adds	incs
----	----
0	62
Add archs:

gcc -Oz:
adds	incs
----	----
0	62
Add archs:

gcc -Og:
adds	incs
----	----
23	39
Add archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4

gcc -Ofast:
adds	incs
----	----
23	39
Add archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4

Here we see the number of archs that use inc and all the archs that use add $1. We also list the archs that use add $1 to get a better idea of the overall landscape.

There are some extremely interesting things for us to observe here. For one, LLVM and GCC clearly have different priorities for optimization: LLVM almost always prefers to use inc. LLVM uses inc 86% of the time for everything except -O0, -Os, and -Oz. LLVM will always use inc when optimizing for size; this is not particularly surprising, since incrementing a register is always smaller than adding one to a register. However, LLVM will always use mov $1 when at -O0.

GCC on the other hand has a wider spread of when it prefers inc, about 63% of the time. Interestingly, GCC will use inc even at -O0. Like LLVM, GCC will always use inc when optimizing for size. There even appears to be some agreement on which archs to not use inc on: the Atom CPU family and its offspring are the primary targets to not use inc.

Questions

These numbers make me wonder why it is that add $1 is preferred over inc in the generic x86-64 arch with LLVM: LLVM believes that inc is better for all CPUs except the Atom CPUs. Doubly so since we see that LLVM does prefer inc for the generic x86-64-v2, x86-64-v3, and x86-64-v4 archs. Perhaps LLVM should be changed to prefer inc when using the generic x86-64 arch. It is not the case that inc would be unavailable on any x86_64 CPU: the inc instruction goes all the way back to the original 8086/8088 instruction set. It even dates back to the pre-x86 world, as even the Intel 8080 and Intel 8008 had increment (and decrement) instructions.

However, it may be the case that the world simply has orders of magnitude more Atom family CPUs than any other type of x86_64 CPU. Perhaps the Atom family only makes up 14% of the x86_64 CPU lines released but those 14% of CPU lines make up over 90% of all x86_64 CPUs manufactured. I don't know, but if that is the case then I would say that LLVM should continue to select add $1 for the generic x86-64 arch.

On the other hand, while GCC also selects inc over add $1 more often, it does not do so nearly as often as LLVM. Like LLVM, GCC also does not select inc for the generic x86-64 arch. Makes me think that perhaps LLVM is preferring size whereas GCC might be preferring latency and throughput, which you can learn a lot about at this website. GCC makes for a harder call whether or not to want to move to inc for the generic x86-64 arch. While I think it may make a lot of sense for LLVM to do so because of its apparent focus on code size, it might be the case that the throughput and latency numbers from studying the uops tables makes for a more nuanced approach with GCC. I'd have to study the uops tables more to make more than an educated guess.

Neither LLVM nor GCC are "wrong" in their selections. It just goes to show how there is no one right answer to optimization. You oftentimes have to prioritize one thing over another. And that's OK. And you can even select the compiler whose optimization priorities match yours!

One last question that I have is why does LLVM always use add $1 for -O0. After all, GCC is smart enough to select inc at all optimization levels, why shouldn't LLVM also be smart enough. I imagine that the LLVM developers might say something like because inc and add affect the FLAGS register differently, specifically the carry flag is not affected by inc, then you must have some sort of liveness tracking for the carry flag to ensure that there is no conditional instruction that uses the carry flag that is depending on the result of the add $1, and this liveness check is most appropriately handled in the optimization passes. I can certainly understand that answer even if I don't think it is the best answer.

What should `O` do?

Armed with this new knowledge, we can begin to think about whether or not we should implement add $1 to inc for O. I think it makes sense for us to follow LLVM's approach of preferring smaller code size. I don't think QBE (and certainly not O) really needs to have lots of special casing for each individual arch that would allow us to select for throughput and latency best like GCC appears to. Additionally, O cannot really have much of a liveness check, though it was mentioned on the QBE mailing list. We're just going to take advantage of the fact that it seems like cproc never issues a conditional instructon without issuing a cmp immediately before, so we need not worry about O needing any liveness checks. With that said, since I cannot fully rule out that cproc could generate code that does have this dependency, I've tweaked our patch to the cproc driver to debug any cases where O fails. What I did was make the -O flag to cproc track an Oflag variable, and in the case where you also issue -S, the Oflag variable will enable or disable sending the QBE-generaed assembly to O. Now if there are any problems with O, I can study the code generated with and without O and figure out what needs fixing.

Incidentally, this same FLAGS issue could affect our previous mov $0 to xor optimization, so perhaps having the ability to debug in case of failure would be nice. I can imagine some day when cproc understands inline assembly, someone could write something that causes O and its optimizations to be incorrect, but hopefully by then QBE makes these optimizations directly with liveness checks so that O can move on to other more challenging optimizatons.

Does QBE optimize `add $1` to `inc` already?

Let's quickly see if QBE is already making this opimization. Let's compile our short test program with cproc -S:

.data
.balign 1
.Lstring.6:
	.ascii "%d\012\000"
/* end data */

.text
.globl main
main:
	pushq %rbp
	movq %rsp, %rbp
	subq $8, %rsp
	pushq %rbx
	xorl %ebx, %ebx
.Lbb22:
	cmpl $1000, %ebx
	jge .Lbb25
	movl %ebx, %esi
	leaq .Lstring.6(%rip), %rdi
	xorl %eax, %eax
	callq printf
	addl $1, %ebx
	jmp .Lbb22
.Lbb25:
	xorl %eax, %eax
	popq %rbx
	leave
	ret
.type main, @function
.size main, .-main
/* end function main */

.section .note.GNU-stack,"",@progbits

So no QBE does not. We'll need to teach QBE about this optimization.

Teaching `O` about `inc`

Teaching O about this optimization is very similar to the mov $0 to xor optimization: it is a one-line optimization that is a simple one-line match and replace. In code, it looks like this:

static int
incq(struct peephole *window)
{
	char buf[32], r1a, r1b;

	if (window->line1 == NULL)
		return 0;

	if (!strncmp("\taddq $1, %r", window->line1, 12)) {
		if (strlen(window->line1) < 14)
			return 0;

		r1a = window->line1[12];
		r1b = window->line1[13];

		if (r1b == '\n') {
			(void) snprintf(buf, sizeof(buf), "\tincq %%r%c\n",
			    r1a);
		} else {
			(void) snprintf(buf, sizeof(buf), "\tincq %%r%c%c\n",
			    r1a, r1b);
		}

		free(window->line1);
		window->line1 = xstrdup(buf);

		return 1;
	}

	return 0;
}

static int
incl(struct peephole *window)
{
	char buf[32], e1a, e1b;

	if (window->line1 == NULL)
		return 0;

	if (!strncmp("\taddl $1, %e", window->line1, 12)) {
		if (strlen(window->line1) != 15)
			return 0;

		e1a = window->line1[12];
		e1b = window->line1[13];

		(void) snprintf(buf, sizeof(buf), "\tincl %%e%c%c\n", e1a,
		    e1b);

		free(window->line1);
		window->line1 = xstrdup(buf);

		return 1;
	} else if (!strncmp("\taddl $1, %r", window->line1, 12)) {
		if (strlen(window->line1) < 14)
			return 0;

		e1a = window->line1[12];
		e1b = window->line1[13];

		if (e1b == 'd') {
			(void) snprintf(buf, sizeof(buf), "\tincl %%r%cd\n",
			    e1a);
		} else {
			(void) snprintf(buf, sizeof(buf), "\tincl %%r%c%cd\n",
			    e1a, e1b);
		}

		free(window->line1);
		window->line1 = xstrdup(buf);

		return 1;
	}

	return 0;
}

And that's it. It's nearly a copy and paste of the xorq and xorl functions.

Doing it all over again for `dec`

Now let's see if things are the same for sub $1 to dec. Here is our slightly changed sample program:

#include <stdio.h>

int
main(void)
{
	for (int i = 1000; i > 0; i--)
		printf("%d\n", i);

	return 0;
}

I modified our shell script slightly. Here is the output:

clang -O0:
subs	decs
----	----
56	0
Sub archs: nocona core2 penryn bonnell atom silvermont slm goldmont goldmont-plus tremont nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 skx cascadelake cooperlake cannonlake icelake-client rocketlake icelake-server tigerlake sapphirerapids alderlake knl knm k8 athlon64 athlon-fx opteron k8-sse3 athlon64-sse3 opteron-sse3 amdfam10 barcelona btver1 btver2 bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 x86-64 x86-64-v2 x86-64-v3 x86-64-v4

clang -O1:
subs	decs
----	----
8	48
Sub archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64

clang -O2:
subs	decs
----	----
8	48
Sub archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64

clang -O3:
subs	decs
----	----
8	48
Sub archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64

clang -Os:
subs	decs
----	----
0	56
Sub archs:

clang -Oz:
subs	decs
----	----
56	0
Sub archs: nocona core2 penryn bonnell atom silvermont slm goldmont goldmont-plus tremont nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 skx cascadelake cooperlake cannonlake icelake-client rocketlake icelake-server tigerlake sapphirerapids alderlake knl knm k8 athlon64 athlon-fx opteron k8-sse3 athlon64-sse3 opteron-sse3 amdfam10 barcelona btver1 btver2 bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 x86-64 x86-64-v2 x86-64-v3 x86-64-v4

clang -Og:
subs	decs
----	----
8	48
Sub archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64

clang -Ofast:
subs	decs
----	----
8	48
Sub archs: silvermont slm goldmont goldmont-plus tremont knl knm x86-64

gcc -O0:
subs	decs
----	----
23	39
Sub archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4

gcc -O1:
subs	decs
----	----
23	39
Sub archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4

gcc -O2:
subs	decs
----	----
23	39
Sub archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4

gcc -O3:
subs	decs
----	----
23	39
Sub archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4

gcc -Os:
subs	decs
----	----
0	62
Sub archs:

gcc -Oz:
subs	decs
----	----
0	62
Sub archs:

gcc -Og:
subs	decs
----	----
23	39
Sub archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4

gcc -Ofast:
subs	decs
----	----
23	39
Sub archs: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 x86-64-v2 x86-64-v3 x86-64-v4

Almost everything here appears exactly the same. Wildly, clang -Oz never uses dec despite the fact that clang -Os always uses dec. This is because LLVM when using -Oz actually uses a jb immediately after a sub $1 to save a cmp! Cool.

OK, so that probably means all the same inferences and same course of action as inc. Let's teach O about dec as well:

static int
decq(struct peephole *window)
{
	char buf[32], r1a, r1b;

	if (window->line1 == NULL)
		return 0;

	if (!strncmp("\tsubq $1, %r", window->line1, 12)) {
		if (strlen(window->line1) < 14)
			return 0;

		r1a = window->line1[12];
		r1b = window->line1[13];

		if (r1b == '\n') {
			(void) snprintf(buf, sizeof(buf), "\tdecq %%r%c\n",
			    r1a);
		} else {
			(void) snprintf(buf, sizeof(buf), "\tdecq %%r%c%c\n",
			    r1a, r1b);
		}

		free(window->line1);
		window->line1 = xstrdup(buf);

		return 1;
	}

	return 0;
}

static int
decl(struct peephole *window)
{
	char buf[32], e1a, e1b;

	if (window->line1 == NULL)
		return 0;

	if (!strncmp("\tsubl $1, %e", window->line1, 12)) {
		if (strlen(window->line1) != 15)
			return 0;

		e1a = window->line1[12];
		e1b = window->line1[13];

		(void) snprintf(buf, sizeof(buf), "\tdecl %%e%c%c\n", e1a,
		    e1b);

		free(window->line1);
		window->line1 = xstrdup(buf);

		return 1;
	} else if (!strncmp("\tsubl $1, %r", window->line1, 12)) {
		if (strlen(window->line1) < 14)
			return 0;

		e1a = window->line1[12];
		e1b = window->line1[13];

		if (e1b == 'd') {
			(void) snprintf(buf, sizeof(buf), "\tdecl %%r%cd\n",
			    e1a);
		} else {
			(void) snprintf(buf, sizeof(buf), "\tdecl %%r%c%cd\n",
			    e1a, e1b);
		}

		free(window->line1);
		window->line1 = xstrdup(buf);

		return 1;
	}

	return 0;
}

Lastly, we need to hook these new optimizations up:

static void
one(struct peephole *window)
{

	if (xorq(window))
		return;

	if (xorl(window))
		return;

	if (incq(window))
		return;

	if (incl(window))
		return;

	if (decq(window))
		return;

	(void) decl(window);
}

Measuring improvements

Let's measure the size improvements before and after as measured by the size of cproc itself. I don't anticipate the differences being as large as with the mov $0 to xor optimization, as the size difference between mov $1 to inc and sub $1 to dec is only one byte: from three bytes to two bytes.

Here is the before:

/home/brian $ size /usr/local/bin/cproc
text    data    bss     dec     hex
16008   2678    584     19270   4b46
/home/brian $ size /usr/local/bin/cproc-qbe
text    data    bss     dec     hex
97270   20240   964     118474  1ceca

And here is the after:

/home/brian $ size /usr/local/bin/cproc
text    data    bss     dec     hex
15980   2678    584     19242   4b2a
/home/brian $ size /usr/local/bin/cproc-qbe
text    data    bss     dec     hex
97114   20240   964     118318  1ce2e

For the cproc executable, that's a reduction of 28 bytes, or a 0.175% size reduction. For the cproc-qbe executable, we have a reduction of 156 bytes, or a 0.160% size reduction.

This may be one of those optimizations where you will always "win" if size is your concern but if performance is your concern it becomes a lot more difficult to discern if you've "won." For me, I want O to prefer size, since that is easier for me to control, and the CPU on my machine happens to get identified as a Haswell, one of the CPUs that both LLVM and GCC select inc for. Your preferences may differ from mine.

Conclusion

Studying how LLVM and GCC selected for these two optimizations led to some interesting surprises, some of which I was not expecting. We learned that, at least for these two optimizations, LLVM and GCC are likely prioritizing different things. Even though I chose to follow the LLVM approach for O, it would be equally valid to choose to follow the GCC approach. We also learned that LLVM is unable or unwilling (or both) to apply these optimizations at -O0 whereas GCC happily will apply these optimizations at all levels. Finally, we learned that LLVM and GCC agree that these optimizations are universal wins when optimizing for size and will always select these optimizations when using -Os and -Oz.

I think QBE is probably better with these two optimizations, even though they are not very big size reductions. If QBE is going to be doing liveness tracking for the FLAGS register for the mov $0 to xor case, then these two can easily come along for the ride. I think cproc is definitely better with these optimizations, as it is a small C compiler and as it does not have the ability to run all the optimizations that LLVM and GCC can, anything that can reduce compiled binary size for the executables and libraries it compiles is valuable.

I hope you enjoyed this further look into compiler optimizations. Perhaps we will continue to explore more such compiler optimizations.

Dr. Brian Robert Callahan

[prev]

[next]

2022-04-17
I wrote a peephole optimizer (for QBE), part 3: Studying the LLVM and GCC optimizers

What would LLVM and GCC do?

Introducing `-march` and `-mtune`

Questions

What should `O` do?

Does QBE optimize `add $1` to `inc` already?

Teaching `O` about `inc`

Doing it all over again for `dec`

Measuring improvements

Conclusion

Dr. Brian Robert Callahan

[prev]

[next]

2022-04-17I wrote a peephole optimizer (for QBE), part 3: Studying the LLVM and GCC optimizers

What would LLVM and GCC do?

Introducing -march and -mtune

Questions

What should O do?

Does QBE optimize add $1 to inc already?

Teaching O about inc

Doing it all over again for dec

Measuring improvements

Conclusion

2022-04-17
I wrote a peephole optimizer (for QBE), part 3: Studying the LLVM and GCC optimizers

Introducing `-march` and `-mtune`

What should `O` do?

Does QBE optimize `add $1` to `inc` already?

Teaching `O` about `inc`

Doing it all over again for `dec`