Dr. Brian Robert Callahan
academic, developer, with an eye towards a brighter techno-social life
Continuing on from our tiny snake game, let's explore more into how binaries work on Unix. Let's figure out how our programs come to obtain their argc and argv parameters.
You can grab the finished code for this write-up here on GitHub.
One thing we exploited with our snake game was the fact that it took no options. That was an easy way to keep the binary size down. If the user did happen to write things on the command line after the command invocation, our game would simply ignore it. Such behavior can be explained in the code that invokes the main function.
main(void)
Our main function takes no arguments. But that's not the norm for Unix programs. It is more common that we see our main function take two parameters.
main(int argc, char *argv[])
An integer telling us the number of arguments and a double pointer which gives us those arguments. But how does our program get those arguments? If we type them out on the command line, what is the mechanism that passes things to our program? Let's write a program that takes command line options. It doesn't have to do anything special. Just reading them in and printing them out would work for us. This is effectively the echo(1) utility we will recreate, though we will leave out the -n
flag.
Let's copy over our crt.s file from SnakeQR. Let's also split it up into three files: the .note.openbsd.ident section will remain in crt.s, _start will go into a new _start.s file and _syscall will go into a new _syscall.s file. Let's write a simple Makefile too while we're here. We will put our C code into an echo.c file.
We are also going to improve our _start function. As we remember from SnakeQR, the equivalent C for our _start function is
void _start(void) { main(); _exit(0); }
But we Unix people know that main is an int and can return different values. And we can use those values in other places (like shell scripts). Our current _start function doesn't cut it. We need something more like
void _start(void) { int argc; char **argv; _exit(main(argc, argv)); }
In assembly, that looks something like
.text .p2align 2 .globl _start _start: callq main movl %eax, %edi movl $1, %eax syscall .size _start,.-_start
Really similar to our previous version of _start but this time instead of clearing %edi we instead take the value of %eax after main exits and put it in %edi, which is the parameter for _exit. Why %eax? Because the return value of functions are put in %eax. We have to remember to move the return value to %edi right away because we need %eax again to provide the syscall number as we learned from SnakeQR. You'll notice that we don't handle argc or argv mostly because we don't know yet how to handle them. And maybe we don't have to. With this, we can write a very simple main function that returns the value of argc.
Maybe it really is this easy; let's write that simple main function.
int main(int argc, char *argv[]) { return argc; }
If we run echo one two three
we should get a return value of 4. Remember that the program name ends up counting in argc and lives in argv[0]. Run it, issue echo $? after and... 0. So no, it wasn't that easy. It was worth a shot.
But we know that argc and argv have to come from somewhere and if it's somewhere, chances are good it will show up in gdb. I'll be using egdb (aka gdb from ports) since that's newer than the gdb in base. We can install it with a simple doas pkg_add gdb
.
If you've never used gdb before, there are a lot of things you can do with it. But let's keep things simple. We can load our program into gdb like so
$ egdb --args ./echo
This will allow us to put any number of command line options, though we're not using any right now. Our argc should be 1 so let's see if we can find a 1 somewhere in gdb.
The easiest thing to do in gdb is run the program. At the (gdb) prompt, type r
and press Enter.
(gdb) r Starting program: /home/brian/echo/echo /home/brian/echo/echo [Inferior 1 (process 38559) exited normally]
That wasn't very interesting. It ran and exited without telling us anything.
In order to do anything meaningful we should insert breakpoints. This tells gdb to pause execution at that memory location. All our functions can be used as locations to break. We can also break at offsets from our functions, but we don't need that functionality for today.
We only have two functions so far: _start and main. My hunch tells me that because _start is our entry point, if our hosted environment is giving us argc and argv, it will definitely be findable in _start, whereas we might lose it by the time we get to main. Let's set a breakpoint at _start.
(gdb) b _start Breakpoint 1 at 0x201170
OK. We're ready to go. Let's run the program now.
(gdb) r Starting program: /home/brian/echo/echo Breakpoint 1, 0x0000000000201170 in _start ()
We are effectively at the point where our hosted environment hands off execution to the program. Now we can explore our program.
info reg
command.
(gdb) info reg rax 0x0 0 rbx 0x0 0 rcx 0x0 0 rdx 0x0 0 rsi 0x0 0 rdi 0x0 0 rbp 0x0 0x0 rsp 0x7f7ffffbe6e0 0x7f7ffffbe6e0 r8 0x0 0 r9 0x0 0 r10 0x0 0 r11 0x0 0 r12 0x0 0 r13 0x0 0 r14 0x0 0 r15 0x0 0 rip 0x201170 0x201170 <_start> eflags 0x202 [ IF ] cs 0x2b 43 ss 0x23 35 ds 0x23 35 es 0x23 35 fs 0x23 35 gs 0x23 35
We know that argc is going to be 1, so any register that is 0 we can ignore. Not what we're looking for. The %rip register holds our instruction pointer, and gdb helpfully tells us that memory location corresponds to the _start function. The eflags register is our status register, which contains information like the carry bit and the zero bit. We can't use that directly. And cs through gs are segment registers, which we are not going to look at today. That leaves us with %rsp as our candidate. It certainly holds something. It happens to be a location on the stack (%rsp = stack pointer). We can examine it with the x/x $rsp
command. This is the x command. For us, we will use it in the form x/<length><format>, where length is the number of format objects we want (if you leave it out, you want one) and format is our format specifier. An x for the format specifier means hexadecimal but there are others too, like f for floating point, i for instruction, and s for string.
(gdb) x/x $rsp 0x7f7ffffbe6e0: 0x00000001
Hey! That's a 1! Let's quit out of gdb and rerun it as egdb --args ./echo one
and see if we get a 2 next time.
(gdb) x/x $rsp 0x7f7ffffcb900: 0x00000002We do!
I think we've discovered that our hosted environment puts argc and argv on the stack and then jumps to _start so that _start also has access to argc and argv. Let's think back to our calling convention: the first argument goes in %rdi. We should be able to pop the top value off the stack and put it in %rdi and then our main function will return argc. Let's add this to _start.
.text .p2align 2 .globl _start _start: popq %rdi callq main movl %eax, %edi movl $1, %eax syscall .size _start,.-_start
All we did was add popq %rdi
before the main call. Let's recompile and see what happens.
/home/brian/echo $ ./echo /home/brian/echo $ echo $? 1 /home/brian/echo $ ./echo one /home/brian/echo $ echo $? 2 /home/brian/echo $ ./echo one two /home/brian/echo $ echo $? 3
I think we can say we are successfully passing argc from our hosted environment to our program.
Now we need to find and pass argv. Maybe it's as simple as the current top of the stack, now that argc has been popped off, is argv. Remember that %rsi is the second parameter according to our calling convention. If it's really that easy all we will need to do is add movq %rsp, %rsi
after the popq %rdi
we just added.
.text .p2align 2 .globl _start _start: popq %rdi movq %rsp, %rsi callq main movl %eax, %edi movl $1, %eax syscall .size _start,.-_start
We should also write a proper echo function in C while we are here. The usual way it goes is skip argv[0], from argv[1] to argv[argc - 1], print the argument and if there is a next argument also print a space. Once we run out of arguments, print a newline. In C, this looks like
int main(int argc, char *argv[]) { int i; for (i = 1; i < argc; i++) { write(1, argv[i], strlen(argv[i])); if (i + 1 != argc) write(1, " ", 1); } write(1, "\n", 1); return 0; }
We don't actually know how long each argument will be, so we need some way to calculate that since our write function requires us to pass the number of characters to write as an argument. The strlen(3) function does that for us. But remember since we're building everything ourselves, let's take a moment to think about what strlen does and how we can recreate it.
The strlen function reads in a string and outputs the number of characters in that string. We can set up a second pointer that points to the first character in the string. Since we know strings in C must end with a NUL byte, we can check to see if our second pointer is the NUL byte and if not move to the next character in the string. When we hit the NUL byte, subtracting the original string location from the second string location will give us the length of the string. We can't have a negative number here so we can return an unsigned long, and that's what the manual page says strlen returns (size_t).
static unsigned long strlen(const char *s) { char *t; t = (char *) s; while (*t != '\0') t++; return t - s; }
Our completed echo.c file looks like this
extern void *_syscall(void *n, void *a, void *b, void *c, void *d, void *e); static void write(int d, const void *buf, unsigned long nbytes) { _syscall((void *) 4, (void *) d, (void *) buf, (void *) nbytes, (void *) 0, (void *) 0); } static unsigned long strlen(const char *s) { char *t; t = (char *) s; while (*t != '\0') t++; return t - s; } int main(int argc, char *argv[]) { int i; for (i = 1; i < argc; i++) { write(1, argv[i], strlen(argv[i])); if (i + 1 != argc) write(1, " ", 1); } write(1, "\n", 1); return 0; }
Not too bad. Let's recompile everything and see what happens.
/home/brian/echo $ ./echo /home/brian/echo $ ./echo one one /home/brian/echo $ ./echo one two one two /home/brian/echo $ ./echo one two three one two three /home/brian/echo $ ./echo one two three four one two three four /home/brian/echo $ ./echo one two three four five one two three four five /home/brian/echo $ ./echo one two three four five six one two three four five six /home/brian/echo $ ./echo one two three four five six seven one two three four five six seven /home/brian/echo $ ./echo one two three four five six seven eight one two three four five six seven eight /home/brian/echo $ ./echo one two three four five six seven eight nine one two three four five six seven eight nine /home/brian/echo $ ./echo one two three four five six seven eight nine ten one two three four five six seven eight nine ten
It really was that easy for once. The thing at the top of the stack after popping argc off the stack is argv. We have successfully passed argc and argv to our program and recreated echo(1)!
There is a third argument to main, which we often don't see. It is envp, the environment pointer. It is found after argv and placed in %rdx, which in assembly is movq 8(%rsp, %rdi, 8), %rdx
. That gives us a final _start.s of
.text .p2align 2 .globl _start _start: popq %rdi movq %rsp, %rsi movq 8(%rsp, %rdi, 8), %rdx callq main movl %eax, %edi movl $1, %eax syscall .size _start,.-_start
You need not worry about envp for our echo program, and can in fact leave that line out. But it is included here for completeness, and you may want it in the future.
I hope you learned something interesting about how argc and argv are passed from a hosted environment to our Unix programs.