Dr. Brian Robert Callahan
academic, developer, with an eye towards a brighter techno-social life
Today, I wanted to see if I could out-compile my C compiler. I added a hand-compiled assembly version of echo from our previous coding exercise and added a new make target, make asm, that will assemble it. Let's look at our hand-compiled assembly and compare it to our C compiler and ask whether or not it was worth it.
It's quite small.
.text
.p2align 2
.globl main
.type main,@function
main:
movq %rdi, %r15 # Get argc from %rdi, put it in %r15
movq %rsi, %r14 # Get argv from %rsi, put it in %r14
loop:
decl %r15d # 47: for (i = 0; i < argc; i++) {
jz done # Rewritten as: while (--argc) {
addq $8, %r14 # ++argv
movl $4, %eax # Set up write(2)
movl $1, %edi # First parameter is 1
movq (%r14), %rsi # Second parameter is *argv
leaq -1(%rsi), %rdx # Get *argv[0]
strlen: # Note: strlen has been inlined
cmpb $0, 1(%rdx) # 36: while (*t != '\0')
leaq 1(%rdx), %rdx # 37: t++;
jne strlen
subq %rsi, %rdx # 39: return t - s;
syscall # 48: write(1, *argv, %rdx);
cmpl $1, %r15d # 49: if (i + 1 != argc)
je done # Rewritten as: if (argc != 1)
movl $4, %eax # Set up write(2)
movl $1, %edi # First parameter is 1
movq %r14, %rsi # Set %rsi back to beginning of *argv
movb $32, (%rsi) # Second parameter is " "
movl $1, %edx # Third parameter is 1
syscall # 50: write(1, " ", 1);
jmp loop
done:
movl $4, %eax # Set up write(2)
movl $1, %edi # First parameter is 1
movq %r14, %rsi # Set %rsi back to beginning of *argv
movb $10, (%rsi) # Second parameter is "\n"
movl $1, %edx # Third parameter is 1
syscall # 52: write(1, "\n", 1);
xorl %eax, %eax # Return value is 0
retq # 54: return 0;
.size main,.-main
The whole source is annotated. Comments that begin with a number are for easy cross-referencing with our C version of echo.
The primary tricks are a manual inlining of the strlen function and not needing the write and _syscall functions, instead being able to use syscall directly. We could inline strlen in the C code if we wanted, but the compiler is very likely to do that for us anyway. We probably can't avoid the write and _syscall functions in C, as we discussed previously.
There is also some rewriting of loops to avoid the use of variables: the for (i = 0; i < argc; i++) is rewritten as while (--argc) and if (i + 1 != argc) becomes if (argc != 1), which means that we can get rid of the int i variable and increment through argv with ++argv. These changes can of course be applied to the C code as well.
Some other tricks include using *argv as scratch space for our space and newline characters. We don't actually care what the value of *argv is once we've written it out—we only use it the one time. So it becomes free memory space for us to overwrite with whatever we want (as long as things fit; our space and newline characters are a single byte each so we're safe). We could do this one in C too if we wanted.
If we made these changes, our new main function in C might look like this.
int
main(int argc, char *argv[])
{
while (--argc) {
++argv;
write(1, *argv, strlen(*argv));
if (argc != 1) {
*argv[0] = ' ';
write(1, *argv, 1);
}
}
*argv[0] = '\n';
write(1, *argv, 1);
return 0;
}
If we let our C compiler (clang 10.0.0) compile this to assembly, we get the following.
.text
.file "echo.c"
.globl main # -- Begin function main
.type main,@function
main: # @main
# %bb.0:
pushq %r15
pushq %r14
pushq %r12
movq %rsi, %r14
movl %edi, %r15d
pushq $1
popq %r12
.LBB0_1: # =>This Loop Header: Depth=1
# Child Loop BB0_3 Depth 2
decl %r15d
je .LBB0_6
# %bb.2: # in Loop: Header=BB0_1 Depth=1
movq 8(%r14), %rdi
addq $8, %r14
leaq -1(%rdi), %rsi
.LBB0_3: # Parent Loop BB0_1 Depth=1
# => This Inner Loop Header: Depth=2
cmpb $0, 1(%rsi)
leaq 1(%rsi), %rsi
jne .LBB0_3
# %bb.4: # in Loop: Header=BB0_1 Depth=1
subq %rdi, %rsi
callq write
cmpl $1, %r15d
je .LBB0_1
# %bb.5: # in Loop: Header=BB0_1 Depth=1
movq (%r14), %rax
movb $32, (%rax)
movq (%r14), %rdi
movq %r12, %rsi
callq write
jmp .LBB0_1
.LBB0_6:
movq (%r14), %rax
movb $10, (%rax)
movq (%r14), %rdi
pushq $1
popq %rsi
callq write
xorl %eax, %eax
popq %r12
popq %r14
popq %r15
retq
.Lfunc_end0:
.size main, .Lfunc_end0-main
# -- End function
.type write,@function # -- Begin function write
write: # @write
# %bb.0:
movq %rsi, %rcx
movq %rdi, %rdx
pushq $4
popq %rdi
pushq $1
popq %rsi
xorl %r8d, %r8d
xorl %r9d, %r9d
jmp _syscall # TAILCALL
.Lfunc_end1:
.size write, .Lfunc_end1-write
# -- End function
.section ".note.GNU-stack","",@progbits
.addrsig
It seems that both clang and I found very similar ways to optimize the program. According to size, I did a little bit better than clang: my hand-compiled assembly came out to a 104-byte object file whereas clang generated a very heavy 125-byte object file. When factoring in the necessary glue code (_start.s, crt,s, and _syscall.s for the C version), my hand-compiled assembly ended up 152 bytes whereas clang was a distant 197 bytes. But clang had the last laugh: running ls -l on the final binaries resulted in 848 bytes for the hand-compiled assembly and 840 bytes for clang. So there is clearly something else going on under the hood that I am losing out on with my hand-compiled assembly. Perhaps you can do better than me.
As will come as little surprise, a production-ready compiler with large teams of developers and larger pools of money is smarter than I am. But I think I did a decent job at it. I believe that learning assembly for several different processors is worthwhile but I think I will stick to higher level languages for most of my work (unless you're paying me to do otherwise...). Because there is one cost we haven't mentioned yet: time. It took me all of 30 seconds, if that, to optimize the C code whereas it took maybe half an hour to write the assembly version.