Dr. Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life



[prev]
[next]

2026-03-22
Semi-retirement, or, really, changing my relationship with the BSDs

January 7, 2013, at around 4:30 PM New York time. My first commit to OpenBSD. 2,404 commits later, and I remain as wanting to make sure OpenBSD is the best operating system it can be as I was at my first commit. That desire has stayed true even as life has made it impossible to commit at the breakneck pace I did when I was in my mid-20s and only had to worry about grad school.

Now I am a professor, and have been for quite some time. I have my own teams of undergraduate and graduate students who are looking for research opportunities to bolster their resumes and help them achieve their personal and professional goals. And at least part of the time, I have been able to turn to the BSDs for ideas and projects for my students. It is the win-win that I did not know I would have access to all those years ago.

I want to spend this blog post talking about one of my recent publications with some of those grad students, and how it has made me rethink my relationship to the technology and—equally as important—the people.

To make something the best

I would like to believe that those of us who get to be part of the development team for any of the BSDs do so because we want to make these things the best they can be. And I am sure I am not alone among the developer groups when I say what started as a technical pursuit turned into a personal one. I am still just as proud to be an OpenBSD developer. And I have made great friends along the way. I have gotten to know very wonderful people that I would not have ever met otherwise. I often think that the pursuit to make something the best has a reflexive effect that makes us grow to be our best.

Those are undoubtedly good things.

You have to examine it

In 2019, OpenBSD developer Todd Mortimer (mortimer@) published a paper at AsiaBSDCon 2019 in which he describes a number of mitigations that you can teach your C compiler—LLVM in OpenBSD's case—to defeat return-oriented programming (ROP) attacks. For our purposes, ROP attacks use what are called "gadgets," snippets of already existing code found in binaries and libraries on your system. You can creatively chain these gadgets together to perform arbitrary instructions on the target computer. We want to try to eliminate as many of these gadgets as we can. While we don't have to remove all gadgets, if we remove enough gadgets that an attacker cannot perform the actions they want to, that would be a win.

Focusing on the amd64 architecture for our exploration, Todd describes how he was able to remove a significant percentage of ROP gadgets through three techniques:

  1. Alternative register selection
  2. Compile-time instruction rewriting
  3. RETGUARD

Some of my grad students wanted to understand the efficacy of these techniques. Responding to the current supply chain shortages for memory, they recognized that older machines may be kept in service longer than originally intended. These older machines might lack recent hardware-based mitigations for ROP attacks, making OpenBSD's software-based mitigations potentially very attractive to port to other operating systems or to incorporate in upstream LLVM.

They quickly discovered that RETGUARD requires significant changes to the compiler, dynamic linker, and all libraries and binaries. So it was out. The other two, however, were trivial to port to FreeBSD (and likely every other operating system). So our plan was to port the alternative register selection and compile-time instruction rewriting mitigations to FreeBSD and test their effects in three ways:

  1. Reducing unique gadgets
  2. Increasing binary sizes
  3. Decreasing runtime performance

Todd left us some goalposts in this paper: the alternative register selection reduced unique kernel gadgets by about 6%, with "negligible impact on code size and performance," even going so far as to say this change is "entirely free" due to this change having no additional compile time or runtime cost.

The compile-time instruction rewriting Todd claims removed an additional 5% of unique gadgets from the kernel, with a binary size increase of 0.15%.

A plain-text reading of the article suggests that we should see a reduction of unique gadgets by approximately 11% with both these mitigations turned on, for a cost of only 0.15% increse in binary size. We noticed rather early on that the binary size increase should be even less because more recent versions of LLVM and GCC encode the rewritten instructions with an overhead of 4 bytes, as opposed to the 6 bytes Todd reported from the older version of LLVM. The instruction rewriting wraps the target instruction with xchgq instructions and flips the source and destination registers in the target instruction.

In Todd's paper, using an older version of LLVM, the xchgq instruction was 3 bytes in size:

	48 87 d8	xchgq %rbx, %rax

Modern versions of both LLVM and GCC will always output a 2-byte instruction for this:

	48 93		xchgq %rbx, %rax

In fact, no matter whether you write xchgq %rax, %rbx or xchgq %rbx, %rax, both compilers will output the two-byte instruction as the order of the registers has no semantic difference. This way, a byte is always saved.

This means we were excited as the binary size increase Todd found should be even less.

We were looking forward to a straightforward publication where we could corroborate Todd's successes and advocate that everyone adopt these basically free and powerful mitigations.

Even, or especially, when it falls short

To test, we installed FreeBSD 14.3, the latest version at the time and beneficially used the same version of LLVM as OpenBSD 7.8, 19.1.7. After installing FreeBSD on our machine, we used ROPgadget to collect the number of unique gadgets in the FreeBSD kernel and libc. We chose ROPgadget because that is what Todd chose to use in his paper. GNU size was used to collect binary size data for the kernel and libc. We then built and installed a suite of ten open source projects, and collected unique gadget and binary sizes for them. We then ran each project's test suite 50 times to collect runtime data.

After this, we added only the alternative register selection mitigation to LLVM, rebuilt all of FreeBSD kernel and world twice, rebuilt the ten open source projects, and redid all the data collection. We then removed the alternative register selection mitigation from LLVM and replaced it with the compile-time instruction rewriting mitigation, and repeated the entire process. Finally, we added back the alternative register selection mitigation to LLVM, combining it with the instruction rewriting mitigation, and repated the entire process one last time. This gave us four sets of data: one with no mitigations, one with only the alternative register selection mitigation, one with only the compile-time instruction rewriting mitigation, and one with both mitigations. We could now analyze all this data to understand the true effects of these mitigations.

But our results were not so clean-cut. Instead of finding obvious wins, we found a nuanced story that suggests these mitigations are at best very modest, at worst at least partially flawed.

Facing reality

There was nothing we could do to match the claimed efficacy of these mitigations. While we did see reduction in unique gadget numbers, nothing was "free." Additionally, these two mitigations are not cumulative: they actually interact with each other and in some cases can reduce effectiveness when combined. Far from the clear and obvious wins previously reported, we see a tale of nuance and modesty.

For context and clarity: when we talk about "binary size increase" in this section, we are specifically referring to the .text ELF section, the section where executable code lives.

Starting with the alternative register selection mitigation, it performed somewhere between subpar and potentially detrimental. The FreeBSD kernel saw a 0.3% unique gadget decrease, with a 0.5% binary size increase. For a mitigation that claimed a 6% reduction of unique gadgets while being "entirely free," this is quite underwhelming. Results were no better for libc: about 0.6% unique gadget reduction with a binary size increase of 0.4%.

The compile-time instruction rewriting mitigation fared substantially better, but still well below claimed results: the kernel saw a 3.6% unique gadget reduction with a 1.8% binary size increase. This is far less than the 5% unique gadget reduction and 0.15% binary size increase originally claimed. For libc, this mitigation produced a 1.9% unique gadget reduction with a 1.3% binary size increase.

Surprisingly, when both mitigations were turned on, we see a potentially paradoxical result. For the kernel, both mitigations turned on yielded a unique gadget reduction of 3.5%, which is 0.1% less than the compile-time instruction rewriting mitigation by itself, combined with a binary size increase of 2.2%. That is to say, for the kernel, turning both mitigations on is worse on both objectives compared to the compile-time instruction wrtiting mitigation by itself: turning both mitigations on results in fewer unique gadgets removed and a substantial increase in binary size. This suggests that there is interplay between these two mitigations and it is not as simple as you add the results of one to the results of the other to get the total results.

For libc, turning on both mitigations results in a 2% unique gadget reduction with a binary size increase of 1.7%. This is a slight 0.1% better than the compile-time instruction rewriting mitigation on its own, with a 0.4% increase in binary size compared to just the one mitigation.

It is clear that the alternative register selection does result in an increase in binary size. In fact, a paired t-test shows that it is a statistically significant increase.

For the userland programs, results are also consistent but severely underwhelming. Overall, the alternative register selection mitigation produced a 0.4% increase in binary size. The compile-time instruction rewriting mitigation produced a 1.5% increase in binary size. Both mitigations together produced a 1.8% binary size increase. The ranges were pretty tight: 0.2% to 0.5% for the alternative register selection mitigation, 0.6% to 2.0% for the compile-time instruction rewriting mitigation, and 1.0% to 2.2% for both mitigations together.

Unique gadget reduction numbers were very project-dependent. In nearly all cases, the number of unique gadgets reduced with both mitigations turned on was somewhere in between each mitigation by themselves. The best individual result was a 5.2% reduction in unique gadgets: this was with both mitigations turned on and applied to the GNU Multiple Precision Arithmetic Library. All other combinations were between about 1% to 4% reductions, with two exceptions.

The GNU coreutils and libgcrypt each saw increases in the number of unique gadgets when mitigations were applied. GNU coreutils saw a 3.7% increase in unique gadgets when both mitigations were applied. Additionally, libgcrypt saw a 3.4% increase in unique gadgets both with the compile-time instruction rewriting mitigation by itself and with both mitigations turned on. We will need to do further research to figure out why this happens.

We found that runtime tests were all quite tight, seemingly all within the margin of error I would assume given these tests were running on an operating system that has more to do than just run the test suites at any given point in time. I am comfortable saying that the runtime impact testing is effectively a wash, though I would like to retest this some time in the future.

Taken all together, our data suggests that these mitigations are best used on a case-by-case basis, where you compare the effects of the compile-time instruction rewriting mitigation against both mitigations together. There is almost no scenario where the alternative register selection on its own will be advantageous.

In order to make it better

I really did go into this assuming that we would fully prove these mitigations, and be able to point out that it was a shame no one else ever researched the power of these two simple to implement mitigations, and call for everyone to adopt them posthaste. It never crossed my mind that these two mitigations would fall so far short than originally claimed. And there may yet be good explanations for the disparity:

We are considering completely redoing this work with the original versions of OpenBSD where these mitigations were applied to determine the possibility of these two points.

However, that would only address the unique gadget reduction numbers. I am still uneasy about the original claims of "entirely free" for the alternative register selection mitigation and only 0.15% binary size increase in the OpenBSD kernel for the compile-time instruction rewrite mitigation. I don't see a situation where we can replicate the originally claimed benefits for their originally claimed cost. I think the reality is that these mitigations just are not that effective, and have costs that are too significant to ignore.

Our anonymous reviewers provided some very helpful additional food for thought: chiefly, that number of unique gadget reduction is not actually a useful metric. It feels like a useful metric until you realize that many gadgets are not useful and don't provide a context in which they could become useful. For example, the gadget nop; nop; ret is a gadget, but it would be decidedly difficult to find a way to use such a gadget in a real attack. We could have a situation where we may have removed a significant number of unique gadgets, but all of the removed gadgets were already not useful. In such a situation, we have succeeded in making our binaries larger and potentially slower, but not more secure. Conversely, we could have another mitigation that might only remove a small handful of gadgets, but if it removes all the useful gadgets, then the remaining gadgets are not something to worry about even if the unique gadget count is still very high. Similarly, the "unique gadgets per kilobyte" metric introduced in the original paper is also of exceedingly limited utility; even if we grant the reductions claimed, we have no way of knowing if the gadgets removed were actually useful gadgets, or if we simply removed useless gadgets.

Indeed, ROPgadget never had an issue finding ROP chains, no matter what combination of mitigations were turned on. This points to these two mitigations not being helpful on their own; they may potentially be useful when combined with other techniques. But it would probably be just as good to leave them out entirely.

You might need to step back

For our paper, we concluded by saying that overall the experiment was a success: you will reduce unique gadget numbers applying these mitigations and binary size increases can be acceptable, and we were inconclusive about runtime performance impacts. For the paper, that is the correct conclusion: we wanted to know if these techniques will reduce your attack surface, and they (mostly) will. Even if they did not live up to their originally claimed effects, the effects we found are real. It is also the kind thing to do; we were building off of someone else's work, and we should treat the people who did the work kindly, even if we end up critiquing the work itself.

However, I now find myself in the awkward position where I am openly critiquing OpenBSD's security claims while also being "on the inside" as an OpenBSD developer. I think that is at best an untenable and undesirable position, at worst one of inescapable bias. But there is an easy solution to this problem. Two, actually. One is to stop pursuing this line of research; however, that ship has sailed since the research article that spawned this blog post is going to publication and will enter the annals of electrical engineering and applied computing research. The other is to remove myself from even the perception of playing both sides by at least semi-retiring from OpenBSD. I say semi-retiring because in the future it might be the case that this line of research exhausts itself and I can feel free to return to OpenBSD as a developer. But that day is not today, and it doesn't appear to be tomorrow either. But the truly nice thing is that, with very few exceptions, a developer is a developer and can always return when they are able.

But this research project has also made me a bit uneasy. If one OpenBSD innovation is not as advertised, what about the others? Who is doing a systematic review of security features developed and championed by OpenBSD? And what about the other BSDs? Who is doing a systematic review of their security innovations?

I suppose that is where I can make my best impact nowadays. I have many undergraduate and grad students excited to do real-world research that makes a difference, and I suddenly have more projects that I can reasonably do at once. A research lab that (in part) systematically studies the BSDs sounds nice, and will provide opportunities for years to come. And who knows, maybe even fosters some of the next generation of *BSD developers across all the projects. I think that would be a wonderful success, and one that more than compensates for my own semi-retirement.

I don't believe my responsibility to make the BSDs the best they can be has changed; only the way I go about making sure all the BSDs are the best they can be.

So that you can launch into the future

The good news is that this work will likely get me out to more *BSD conferences; I have missed being able to attend these. I have effectively stopped attending conferences that I am not speaking at, and with rare exception I no longer speak at conferences unless I bring students to speak with me.

I have also been removed as MAINTAINER from all my OpenBSD ports, so please don't ask me about those. Their maintenance has returned to the community, and it is the community you should engage with if you have any questions about my former ports. I might do a very occasional drive-by update for some of my high-priority targets, like DMD, but that's about it for now.

If any of the *BSD projects have ideas they would like us to work on, please get in touch with me. Our research lab efforts should be complementary, not antagonistic.

Top

RSS