Dr. Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life



[prev]
[next]

2020-08-12
Can we do better than our C compiler?

Today, I wanted to see if I could out-compile my C compiler. I added a hand-compiled assembly version of echo from our previous coding exercise and added a new make target, make asm, that will assemble it. Let's look at our hand-compiled assembly and compare it to our C compiler and ask whether or not it was worth it.

My hand-compiled assembly

It's quite small.

	.text
	.p2align 2
	.globl	main
	.type	main,@function
main:
	movq	%rdi, %r15	# Get argc from %rdi, put it in %r15
	movq	%rsi, %r14	# Get argv from %rsi, put it in %r14
loop:
	decl	%r15d		# 47: for (i = 0; i < argc; i++) {
	jz	done		# Rewritten as: while (--argc) {
	addq	$8, %r14	# ++argv
	movl	$4, %eax	# Set up write(2)
	movl	$1, %edi	# First parameter is 1
	movq	(%r14), %rsi	# Second parameter is *argv
	leaq	-1(%rsi), %rdx	# Get *argv[0]
strlen:				# Note: strlen has been inlined
	cmpb	$0, 1(%rdx)	# 36: while (*t != '\0')
	leaq	1(%rdx), %rdx	# 37: t++;
	jne	strlen
	subq	%rsi, %rdx	# 39: return t - s;
	syscall			# 48: write(1, *argv, %rdx);
	cmpl	$1, %r15d	# 49: if (i + 1 != argc)
	je	done		# Rewritten as: if (argc != 1)
	movl	$4, %eax	# Set up write(2)
	movl	$1, %edi	# First parameter is 1
	movq	%r14, %rsi	# Set %rsi back to beginning of *argv
	movb	$32, (%rsi)	# Second parameter is " "
	movl	$1, %edx	# Third parameter is 1
	syscall			# 50: write(1, " ", 1);
	jmp	loop
done:
	movl	$4, %eax	# Set up write(2)
	movl	$1, %edi	# First parameter is 1
	movq	%r14, %rsi	# Set %rsi back to beginning of *argv
	movb	$10, (%rsi)	# Second parameter is "\n"
	movl	$1, %edx	# Third parameter is 1
	syscall			# 52: write(1, "\n", 1);
	xorl	%eax, %eax	# Return value is 0
	retq			# 54: return 0;
	.size	main,.-main

The whole source is annotated. Comments that begin with a number are for easy cross-referencing with our C version of echo.

Better optimization for the C code

The primary tricks are a manual inlining of the strlen function and not needing the write and _syscall functions, instead being able to use syscall directly. We could inline strlen in the C code if we wanted, but the compiler is very likely to do that for us anyway. We probably can't avoid the write and _syscall functions in C, as we discussed previously.

There is also some rewriting of loops to avoid the use of variables: the for (i = 0; i < argc; i++) is rewritten as while (--argc) and if (i + 1 != argc) becomes if (argc != 1), which means that we can get rid of the int i variable and increment through argv with ++argv. These changes can of course be applied to the C code as well.

Some other tricks include using *argv as scratch space for our space and newline characters. We don't actually care what the value of *argv is once we've written it out—we only use it the one time. So it becomes free memory space for us to overwrite with whatever we want (as long as things fit; our space and newline characters are a single byte each so we're safe). We could do this one in C too if we wanted.

If we made these changes, our new main function in C might look like this.

int
main(int argc, char *argv[])
{

	while (--argc) {
		++argv;
		write(1, *argv, strlen(*argv));
		if (argc != 1) {
			*argv[0] = ' ';
			write(1, *argv, 1);
		}
	}
	*argv[0] = '\n';
	write(1, *argv, 1);

	return 0;
}

Even better assembly?

If we let our C compiler (clang 10.0.0) compile this to assembly, we get the following.

	.text
	.file	"echo.c"
	.globl	main                    # -- Begin function main
	.type	main,@function
main:                                   # @main
# %bb.0:
	pushq	%r15
	pushq	%r14
	pushq	%r12
	movq	%rsi, %r14
	movl	%edi, %r15d
	pushq	$1
	popq	%r12
.LBB0_1:                                # =>This Loop Header: Depth=1
                                        #     Child Loop BB0_3 Depth 2
	decl	%r15d
	je	.LBB0_6
# %bb.2:                                #   in Loop: Header=BB0_1 Depth=1
	movq	8(%r14), %rdi
	addq	$8, %r14
	leaq	-1(%rdi), %rsi
.LBB0_3:                                #   Parent Loop BB0_1 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	cmpb	$0, 1(%rsi)
	leaq	1(%rsi), %rsi
	jne	.LBB0_3
# %bb.4:                                #   in Loop: Header=BB0_1 Depth=1
	subq	%rdi, %rsi
	callq	write
	cmpl	$1, %r15d
	je	.LBB0_1
# %bb.5:                                #   in Loop: Header=BB0_1 Depth=1
	movq	(%r14), %rax
	movb	$32, (%rax)
	movq	(%r14), %rdi
	movq	%r12, %rsi
	callq	write
	jmp	.LBB0_1
.LBB0_6:
	movq	(%r14), %rax
	movb	$10, (%rax)
	movq	(%r14), %rdi
	pushq	$1
	popq	%rsi
	callq	write
	xorl	%eax, %eax
	popq	%r12
	popq	%r14
	popq	%r15
	retq
.Lfunc_end0:
	.size	main, .Lfunc_end0-main
                                        # -- End function
	.type	write,@function         # -- Begin function write
write:                                  # @write
# %bb.0:
	movq	%rsi, %rcx
	movq	%rdi, %rdx
	pushq	$4
	popq	%rdi
	pushq	$1
	popq	%rsi
	xorl	%r8d, %r8d
	xorl	%r9d, %r9d
	jmp	_syscall                # TAILCALL
.Lfunc_end1:
	.size	write, .Lfunc_end1-write
                                        # -- End function
	.section	".note.GNU-stack","",@progbits
	.addrsig

It seems that both clang and I found very similar ways to optimize the program. According to size, I did a little bit better than clang: my hand-compiled assembly came out to a 104-byte object file whereas clang generated a very heavy 125-byte object file. When factoring in the necessary glue code (_start.s, crt,s, and _syscall.s for the C version), my hand-compiled assembly ended up 152 bytes whereas clang was a distant 197 bytes. But clang had the last laugh: running ls -l on the final binaries resulted in 848 bytes for the hand-compiled assembly and 840 bytes for clang. So there is clearly something else going on under the hood that I am losing out on with my hand-compiled assembly. Perhaps you can do better than me.

Smarter than me

As will come as little surprise, a production-ready compiler with large teams of developers and larger pools of money is smarter than I am. But I think I did a decent job at it. I believe that learning assembly for several different processors is worthwhile but I think I will stick to higher level languages for most of my work (unless you're paying me to do otherwise...). Because there is one cost we haven't mentioned yet: time. It took me all of 30 seconds, if that, to optimize the C code whereas it took maybe half an hour to write the assembly version.

Conclusion

I enjoy assembly. I teach courses that require a deep knowledge of assembly. But I think I'm going to stick to optimizing higher level code rather than cranking out assembly by hand. Or maybe not. Even our original C code was honestly good enough. I'm glad to live in a world where we have a choice of smart compilers.

Top

RSS