Dr. Brian Robert Callahan
academic, developer, with an eye towards a brighter techno-social life
Today, I wanted to see if I could out-compile my C compiler. I added a hand-compiled assembly version of echo from our previous coding exercise and added a new make target, make asm
, that will assemble it. Let's look at our hand-compiled assembly and compare it to our C compiler and ask whether or not it was worth it.
It's quite small.
.text .p2align 2 .globl main .type main,@function main: movq %rdi, %r15 # Get argc from %rdi, put it in %r15 movq %rsi, %r14 # Get argv from %rsi, put it in %r14 loop: decl %r15d # 47: for (i = 0; i < argc; i++) { jz done # Rewritten as: while (--argc) { addq $8, %r14 # ++argv movl $4, %eax # Set up write(2) movl $1, %edi # First parameter is 1 movq (%r14), %rsi # Second parameter is *argv leaq -1(%rsi), %rdx # Get *argv[0] strlen: # Note: strlen has been inlined cmpb $0, 1(%rdx) # 36: while (*t != '\0') leaq 1(%rdx), %rdx # 37: t++; jne strlen subq %rsi, %rdx # 39: return t - s; syscall # 48: write(1, *argv, %rdx); cmpl $1, %r15d # 49: if (i + 1 != argc) je done # Rewritten as: if (argc != 1) movl $4, %eax # Set up write(2) movl $1, %edi # First parameter is 1 movq %r14, %rsi # Set %rsi back to beginning of *argv movb $32, (%rsi) # Second parameter is " " movl $1, %edx # Third parameter is 1 syscall # 50: write(1, " ", 1); jmp loop done: movl $4, %eax # Set up write(2) movl $1, %edi # First parameter is 1 movq %r14, %rsi # Set %rsi back to beginning of *argv movb $10, (%rsi) # Second parameter is "\n" movl $1, %edx # Third parameter is 1 syscall # 52: write(1, "\n", 1); xorl %eax, %eax # Return value is 0 retq # 54: return 0; .size main,.-main
The whole source is annotated. Comments that begin with a number are for easy cross-referencing with our C version of echo.
The primary tricks are a manual inlining of the strlen function and not needing the write and _syscall functions, instead being able to use syscall directly. We could inline strlen in the C code if we wanted, but the compiler is very likely to do that for us anyway. We probably can't avoid the write and _syscall functions in C, as we discussed previously.
There is also some rewriting of loops to avoid the use of variables: the for (i = 0; i < argc; i++)
is rewritten as while (--argc)
and if (i + 1 != argc)
becomes if (argc != 1)
, which means that we can get rid of the int i
variable and increment through argv with ++argv
. These changes can of course be applied to the C code as well.
Some other tricks include using *argv as scratch space for our space and newline characters. We don't actually care what the value of *argv is once we've written it out—we only use it the one time. So it becomes free memory space for us to overwrite with whatever we want (as long as things fit; our space and newline characters are a single byte each so we're safe). We could do this one in C too if we wanted.
If we made these changes, our new main function in C might look like this.
int main(int argc, char *argv[]) { while (--argc) { ++argv; write(1, *argv, strlen(*argv)); if (argc != 1) { *argv[0] = ' '; write(1, *argv, 1); } } *argv[0] = '\n'; write(1, *argv, 1); return 0; }
If we let our C compiler (clang 10.0.0) compile this to assembly, we get the following.
.text .file "echo.c" .globl main # -- Begin function main .type main,@function main: # @main # %bb.0: pushq %r15 pushq %r14 pushq %r12 movq %rsi, %r14 movl %edi, %r15d pushq $1 popq %r12 .LBB0_1: # =>This Loop Header: Depth=1 # Child Loop BB0_3 Depth 2 decl %r15d je .LBB0_6 # %bb.2: # in Loop: Header=BB0_1 Depth=1 movq 8(%r14), %rdi addq $8, %r14 leaq -1(%rdi), %rsi .LBB0_3: # Parent Loop BB0_1 Depth=1 # => This Inner Loop Header: Depth=2 cmpb $0, 1(%rsi) leaq 1(%rsi), %rsi jne .LBB0_3 # %bb.4: # in Loop: Header=BB0_1 Depth=1 subq %rdi, %rsi callq write cmpl $1, %r15d je .LBB0_1 # %bb.5: # in Loop: Header=BB0_1 Depth=1 movq (%r14), %rax movb $32, (%rax) movq (%r14), %rdi movq %r12, %rsi callq write jmp .LBB0_1 .LBB0_6: movq (%r14), %rax movb $10, (%rax) movq (%r14), %rdi pushq $1 popq %rsi callq write xorl %eax, %eax popq %r12 popq %r14 popq %r15 retq .Lfunc_end0: .size main, .Lfunc_end0-main # -- End function .type write,@function # -- Begin function write write: # @write # %bb.0: movq %rsi, %rcx movq %rdi, %rdx pushq $4 popq %rdi pushq $1 popq %rsi xorl %r8d, %r8d xorl %r9d, %r9d jmp _syscall # TAILCALL .Lfunc_end1: .size write, .Lfunc_end1-write # -- End function .section ".note.GNU-stack","",@progbits .addrsig
It seems that both clang and I found very similar ways to optimize the program. According to size, I did a little bit better than clang: my hand-compiled assembly came out to a 104-byte object file whereas clang generated a very heavy 125-byte object file. When factoring in the necessary glue code (_start.s, crt,s, and _syscall.s for the C version), my hand-compiled assembly ended up 152 bytes whereas clang was a distant 197 bytes. But clang had the last laugh: running ls -l on the final binaries resulted in 848 bytes for the hand-compiled assembly and 840 bytes for clang. So there is clearly something else going on under the hood that I am losing out on with my hand-compiled assembly. Perhaps you can do better than me.
As will come as little surprise, a production-ready compiler with large teams of developers and larger pools of money is smarter than I am. But I think I did a decent job at it. I believe that learning assembly for several different processors is worthwhile but I think I will stick to higher level languages for most of my work (unless you're paying me to do otherwise...). Because there is one cost we haven't mentioned yet: time. It took me all of 30 seconds, if that, to optimize the C code whereas it took maybe half an hour to write the assembly version.