Dr. Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life



[prev]
[next]

2020-08-08
Where do argc and argv come from?

Continuing on from our tiny snake game, let's explore more into how binaries work on Unix. Let's figure out how our programs come to obtain their argc and argv parameters.

You can grab the finished code for this write-up here on GitHub.

The past: a no option world

One thing we exploited with our snake game was the fact that it took no options. That was an easy way to keep the binary size down. If the user did happen to write things on the command line after the command invocation, our game would simply ignore it. Such behavior can be explained in the code that invokes the main function.

main(void)

Our main function takes no arguments. But that's not the norm for Unix programs. It is more common that we see our main function take two parameters.

main(int argc, char *argv[])

An integer telling us the number of arguments and a double pointer which gives us those arguments. But how does our program get those arguments? If we type them out on the command line, what is the mechanism that passes things to our program? Let's write a program that takes command line options. It doesn't have to do anything special. Just reading them in and printing them out would work for us. This is effectively the echo(1) utility we will recreate, though we will leave out the -n flag.

Setting up our environment

Let's copy over our crt.s file from SnakeQR. Let's also split it up into three files: the .note.openbsd.ident section will remain in crt.s, _start will go into a new _start.s file and _syscall will go into a new _syscall.s file. Let's write a simple Makefile too while we're here. We will put our C code into an echo.c file.

We are also going to improve our _start function. As we remember from SnakeQR, the equivalent C for our _start function is

void
_start(void)
{

	main();

	_exit(0);
}

But we Unix people know that main is an int and can return different values. And we can use those values in other places (like shell scripts). Our current _start function doesn't cut it. We need something more like

void
_start(void)
{
	int argc;
	char **argv;

	_exit(main(argc, argv));
}

In assembly, that looks something like

	.text
	.p2align 2
	.globl	_start
_start:
	callq	main
	movl	%eax, %edi
	movl	$1, %eax
	syscall
	.size	_start,.-_start

Really similar to our previous version of _start but this time instead of clearing %edi we instead take the value of %eax after main exits and put it in %edi, which is the parameter for _exit. Why %eax? Because the return value of functions are put in %eax. We have to remember to move the return value to %edi right away because we need %eax again to provide the syscall number as we learned from SnakeQR. You'll notice that we don't handle argc or argv mostly because we don't know yet how to handle them. And maybe we don't have to. With this, we can write a very simple main function that returns the value of argc.

Is it really that easy?

Maybe it really is this easy; let's write that simple main function.

int
main(int argc, char *argv[])
{

	return argc;
}

If we run echo one two three we should get a return value of 4. Remember that the program name ends up counting in argc and lives in argv[0]. Run it, issue echo $? after and... 0. So no, it wasn't that easy. It was worth a shot.

But we know that argc and argv have to come from somewhere and if it's somewhere, chances are good it will show up in gdb. I'll be using egdb (aka gdb from ports) since that's newer than the gdb in base. We can install it with a simple doas pkg_add gdb.

Running our program in gdb

If you've never used gdb before, there are a lot of things you can do with it. But let's keep things simple. We can load our program into gdb like so

$ egdb --args ./echo

This will allow us to put any number of command line options, though we're not using any right now. Our argc should be 1 so let's see if we can find a 1 somewhere in gdb.

The easiest thing to do in gdb is run the program. At the (gdb) prompt, type r and press Enter.

(gdb) r
Starting program: /home/brian/echo/echo
/home/brian/echo/echo
[Inferior 1 (process 38559) exited normally]

That wasn't very interesting. It ran and exited without telling us anything.

Breaking our program

In order to do anything meaningful we should insert breakpoints. This tells gdb to pause execution at that memory location. All our functions can be used as locations to break. We can also break at offsets from our functions, but we don't need that functionality for today.

We only have two functions so far: _start and main. My hunch tells me that because _start is our entry point, if our hosted environment is giving us argc and argv, it will definitely be findable in _start, whereas we might lose it by the time we get to main. Let's set a breakpoint at _start.

(gdb) b _start
Breakpoint 1 at 0x201170

OK. We're ready to go. Let's run the program now.

(gdb) r
Starting program: /home/brian/echo/echo

Breakpoint 1, 0x0000000000201170 in _start ()

We are effectively at the point where our hosted environment hands off execution to the program. Now we can explore our program.

Getting information

We can tell gdb to give us the contents of all our registers with the info reg command.

(gdb) info reg
rax            0x0      0
rbx            0x0      0
rcx            0x0      0
rdx            0x0      0
rsi            0x0      0
rdi            0x0      0
rbp            0x0      0x0
rsp            0x7f7ffffbe6e0   0x7f7ffffbe6e0
r8             0x0      0
r9             0x0      0
r10            0x0      0
r11            0x0      0
r12            0x0      0
r13            0x0      0
r14            0x0      0
r15            0x0      0
rip            0x201170 0x201170 <_start>
eflags         0x202    [ IF ]
cs             0x2b     43
ss             0x23     35
ds             0x23     35
es             0x23     35
fs             0x23     35
gs             0x23     35

We know that argc is going to be 1, so any register that is 0 we can ignore. Not what we're looking for. The %rip register holds our instruction pointer, and gdb helpfully tells us that memory location corresponds to the _start function. The eflags register is our status register, which contains information like the carry bit and the zero bit. We can't use that directly. And cs through gs are segment registers, which we are not going to look at today. That leaves us with %rsp as our candidate. It certainly holds something. It happens to be a location on the stack (%rsp = stack pointer). We can examine it with the x/x $rsp command. This is the x command. For us, we will use it in the form x/<length><format>, where length is the number of format objects we want (if you leave it out, you want one) and format is our format specifier. An x for the format specifier means hexadecimal but there are others too, like f for floating point, i for instruction, and s for string.

(gdb) x/x $rsp
0x7f7ffffbe6e0: 0x00000001

Hey! That's a 1! Let's quit out of gdb and rerun it as egdb --args ./echo one and see if we get a 2 next time.

(gdb) x/x $rsp
0x7f7ffffcb900: 0x00000002
We do!

Passing argc from our hosted environment to our program

I think we've discovered that our hosted environment puts argc and argv on the stack and then jumps to _start so that _start also has access to argc and argv. Let's think back to our calling convention: the first argument goes in %rdi. We should be able to pop the top value off the stack and put it in %rdi and then our main function will return argc. Let's add this to _start.

	.text
	.p2align 2
	.globl	_start
_start:
	popq	%rdi
	callq	main
	movl	%eax, %edi
	movl	$1, %eax
	syscall
	.size	_start,.-_start

All we did was add popq %rdi before the main call. Let's recompile and see what happens.

/home/brian/echo $ ./echo        
/home/brian/echo $ echo $?       
1
/home/brian/echo $ ./echo one
/home/brian/echo $ echo $?    
2
/home/brian/echo $ ./echo one two
/home/brian/echo $ echo $?        
3

I think we can say we are successfully passing argc from our hosted environment to our program.

Passing argv to our program, part 1

Now we need to find and pass argv. Maybe it's as simple as the current top of the stack, now that argc has been popped off, is argv. Remember that %rsi is the second parameter according to our calling convention. If it's really that easy all we will need to do is add movq %rsp, %rsi after the popq %rdi we just added.

	.text
	.p2align 2
	.globl	_start
_start:
	popq	%rdi
	movq	%rsp, %rsi
	callq	main
	movl	%eax, %edi
	movl	$1, %eax
	syscall
	.size	_start,.-_start

We should also write a proper echo function in C while we are here. The usual way it goes is skip argv[0], from argv[1] to argv[argc - 1], print the argument and if there is a next argument also print a space. Once we run out of arguments, print a newline. In C, this looks like

int
main(int argc, char *argv[])
{
	int i;

	for (i = 1; i < argc; i++) {
		write(1, argv[i], strlen(argv[i]));
		if (i + 1 != argc)
			write(1, " ", 1);
	}
	write(1, "\n", 1);

	return 0;
}

strlen

We don't actually know how long each argument will be, so we need some way to calculate that since our write function requires us to pass the number of characters to write as an argument. The strlen(3) function does that for us. But remember since we're building everything ourselves, let's take a moment to think about what strlen does and how we can recreate it.

The strlen function reads in a string and outputs the number of characters in that string. We can set up a second pointer that points to the first character in the string. Since we know strings in C must end with a NUL byte, we can check to see if our second pointer is the NUL byte and if not move to the next character in the string. When we hit the NUL byte, subtracting the original string location from the second string location will give us the length of the string. We can't have a negative number here so we can return an unsigned long, and that's what the manual page says strlen returns (size_t).

static unsigned long
strlen(const char *s)
{
	char *t;

	t = (char *) s;
	while (*t != '\0')
		t++;

	return t - s;
}

Passing argv to our program, part 2

Our completed echo.c file looks like this

extern void *_syscall(void *n, void *a, void *b, void *c, void *d, void *e);

static void
write(int d, const void *buf, unsigned long nbytes)
{

	_syscall((void *) 4, (void *) d, (void *) buf, (void *) nbytes, (void *) 0, (void *) 0);
}

static unsigned long
strlen(const char *s)
{
	char *t;

	t = (char *) s;
	while (*t != '\0')
		t++;

	return t - s;
}

int
main(int argc, char *argv[])
{
	int i;

	for (i = 1; i < argc; i++) {
		write(1, argv[i], strlen(argv[i]));
		if (i + 1 != argc)
			write(1, " ", 1);
	}
	write(1, "\n", 1);

	return 0;
}

Not too bad. Let's recompile everything and see what happens.

/home/brian/echo $ ./echo

/home/brian/echo $ ./echo one
one
/home/brian/echo $ ./echo one two
one two
/home/brian/echo $ ./echo one two three
one two three
/home/brian/echo $ ./echo one two three four
one two three four
/home/brian/echo $ ./echo one two three four five
one two three four five
/home/brian/echo $ ./echo one two three four five six 
one two three four five six
/home/brian/echo $ ./echo one two three four five six seven
one two three four five six seven
/home/brian/echo $ ./echo one two three four five six seven eight
one two three four five six seven eight
/home/brian/echo $ ./echo one two three four five six seven eight nine
one two three four five six seven eight nine
/home/brian/echo $ ./echo one two three four five six seven eight nine ten
one two three four five six seven eight nine ten

It really was that easy for once. The thing at the top of the stack after popping argc off the stack is argv. We have successfully passed argc and argv to our program and recreated echo(1)!

envp

There is a third argument to main, which we often don't see. It is envp, the environment pointer. It is found after argv and placed in %rdx, which in assembly is movq 8(%rsp, %rdi, 8), %rdx. That gives us a final _start.s of

	.text
	.p2align 2
	.globl	_start
_start:
	popq	%rdi
	movq	%rsp, %rsi
	movq	8(%rsp, %rdi, 8), %rdx
	callq	main
	movl	%eax, %edi
	movl	$1, %eax
	syscall
	.size	_start,.-_start

You need not worry about envp for our echo program, and can in fact leave that line out. But it is included here for completeness, and you may want it in the future.

Wrapping up

I hope you learned something interesting about how argc and argv are passed from a hosted environment to our Unix programs.

Top