Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life



[prev]
[next]

2021-05-29
Profiling our assembler with the DMD compiler

If you care about performance, at some point you will need to use some tools to understand where the performance issues with your programs are. Do not assume that you can read the code and know where the performance problems lie: you will often be very wrong, spending a lot of time trying to optimize that which does not need it and fail to optimize that which does.

Let's spend some time today optimizing our assembler. I know it is already very fast, but I suspect we can make at least a couple of improvements. We can use a technique called profiling to learn about what our assembler is really doing during an invocation. Profiling is quite an old technique but it is still quite powerful. We won't dive into all the different kinds of profilers out there; we will pick one profiling tool and just use it. We're just learning about profiling today, after all.

Profiling with the Digital Mars D Compiler

The DMD compiler is the reference compiler for D. It provides a command line switch, -profile, that profiles runtime performance of the program it compiles. There is also an additional profiling switch, -profile=gc, which profiles memory. I don't care so much about profiling memory; our assembler is a very short-lived program that takes the traditional memory management approach of having no memory management. So we will only add the -profile switch to our DMD compilation. There is also a way to automagically do this work with DUB: dub build -b profile. I am going to use that dub invocation to make sure everything goes right.

Creating a profiling log

Now that we have a version of our assembler that can output profiling information, we need to actually create that information. Unfortunately, none of the programs in the a80 repository are really significant enough to get any meaningful results. Instead, I am going to use my port of the CP/M 2.2 assembler. It is identical to the original, with some syntax massaged to work with our assembler. This is a 2500 line program that uses most of the 8080 opcodes, so it is a good program to use for our profiling run.

We run our assembler just as we would any other invocation on the CP/M 2.2 assembler source code. Now, instead of only producing the assembler binary (which it does do), it also produces two new files: trace.def and trace.log. These files contain information about the invocation of our assembler: which functions are called, how many times they are called, how much time is taken by each function in total across the program's run. Those are the pieces of information that will be most useful to us today.

Finding tools to read the profiling details

If the two new files are supposed to be human-readable, they are not readable by this human. I could not find any real information about how to read these files anywhere in the D documentation, unfortunately. Maybe I missed it. If you know where it is, please let me know. However, in that searching, I did find that there is an external utility that can read the trace files and from them produce HTML pages that are easily readable and understandable by this human. After compiling this d-profile-viewer utility, you simply run it without arguments in the directory that contains the trace file and it will produce a trace.html file which you can then open in a web browser. That is what I did.

Reading the profiler output

Here is the output from our initial profiling run. Let's talk about it. Keep it up as well since we will perform subsequent profiling runs after we make some optimizations to see the improvements.

We begin with an overview table, followed by more details and pie charts for each function. The overview table contains the following information:

Remember that the F+D times and percentages may add up to well over 100%. The plain F times and percentages will add up to 100%.

Observations and optimizations

My first observation is that the a16 function seems to be taking a lot of time. In terms of F time, it is the largest function. It turns out the function was doing a lot of useless work: in both passes, if a label is being referenced, it checks for the existence of the label in the symbol table. However, that information is only necessary during the second pass. So we can save a good bit of time by moving the label check to the second pass only. I did that in this commit. It will decrease the time spent in a16 significantly. That's a pretty good optimization, and a good logic fix for our assembler.

My second observation is that whitespace is called 11152 times. That's far and away the most called function. It also seems to be taking a very long time considering how little work it does: we are spending 465 microseconds, or 3.35% of our total execution time, just parsing whitespace. Considering that the parser in total takes 1481 microseconds (F+D time), or 10.66% (F+D time %) of our total execution time, it seems unreasonable that a little over 30% of our parsing time should be on whitespace. The fact that we see this function on our list at all, when we do not see its sibling function endoftoken anywhere in our HTML file, may be telling. Perhaps DMD was able to inline endoftoken whereas it was not able to inline whitespace. In any case, we probably want to consider finding a way to inline whitespace to improve the performance of our parser.

On function inlining

The simple reality is that calling a function has some overhead. Call a function once or twice, and that overhead is negligible. Call a function 11,000 times like we are doing with whitespace, and that may well have some noticable overhead. We can save all of that overhead by inlining the function into the appropriate places. The DMD compiler does have an -inline switch that uses some heuristics to calculate the cost of inlining and inlines the function if its cost is below some threshold. I suppose that whitespace does not fall below the threshold. I attempted to convince the compiler to perform some inlining using the pragma(inline) syntax, but that was not enough to convince the compiler to perform inlining. Perhaps I used it incorrectly.

I decided to manually inline whitespace where it made sense and committed that change. While there, I added some more comments in the parser to make it more clear to future readers the different parts of the parser. It would be nice to keep the whitespace function as a subroutine of the parser and have the compiler be smart enough to inline it. Until then, I think the performance improvement of manual inlining makes too much sense not to do. If you know how to teach the compiler to inline the original subroutine, please let me know!

Profiling our new assembler

Here is the output from a profiling run made after making all our optimizations. It has some important differences from our original profiling run.

We can see that we reduced the time spent in a16 a little over 40%, from 4097 microseconds to 2402 microseconds. The a16 function is still a little expensive, but I think I know what is happening here: we are seeing the cost of traversing the symbol table. Remember that our symbol table has a rather simplistic data structure: it is an array of structs, where the struct holds a string for the symbol name and an unsigned short for the address of the label. While that is certainly quite simple and simplicity was an explicit goal of our assembler, it is definitely not the fastest data structure when we need to search through it. A hash table might be a better solution for search speed. In any event, the newly optimized a16 is not so expensive for what it does that I am concerned, so I will leave it as it is.

There is no longer a listing for whitespace. That makes a lot of sense, since we manually inlined it where needed. What I was not expecting was just how expensive whitespace truly was. In the original profiling run, the parser had an F time of 1016 microseconds and an F+D time of 1481 microseconds, taking 10.66% (F+D time %) of the total execution time. In the new profiling run, the parser had an F time of 714 microseconds and takes 6.43% (F+D time %) of the total execution time. Now that there are no calls to whitespace, F time and F+D time are the same for the parser.

A more performant assembler

Yes, our assembler was already very fast. But it does not hurt to make it even faster. Let's recap what we learned today.

First, while I would perhaps have discovered the logic bug in a16, the profiler made it obvious. That helped me find and fix it sooner than I would have without it.

Second, I never would have guessed that whitespace was performing so poorly. While you might say what is 465 microseconds really, the savings were actually quite a lot more. We more than halved the execution time of the parser due to the removal of all that overhead for each call to whitespace. In addition to wall clock time savings, we also reduced the percentage of time spent in the parser overall by nearly 40%, even after accounting for the a16 time reduction.

If you have never used a profiler before, I want to encourage you to give it a try. The DMD compiler makes it easy to get started with profiling, and as we have seen, the gains can be both unexpected and significant!

Top

RSS