Debugging complex, real-time, multi-CPU programs like HPVMTao3D or the software in the DxO ONE can be a real challenge. You need detailed information about what’s going on, but at the same time, adding instrumentation, e.g. a big bunch of ad-hoc printf statements, can change the behavior of the program sufficiently to make the problem you are chasing go away. In addition, adding instrumentation is often a time-consuming incremental process, and if you build this instrumentation only to remove it later, your investment and time are wasted.

Instrumentation should be there when you need it

So what you really need is instrumentation that always stays in your program and is always active, so that you can look at the issue later. A flight recorder is one solution to this problem. The purpose of a flight-recorder is to continuously record detailed information about what is happening in the system, and to be able to dump it in case you need it, e.g. right after you detect an error, or during a debugging session.

I will explain here the design of the flight recorder I just added to the implementation of the ELFE programming language.

The flight recorder I will present is lock-free yet works with multiple CPUs. It supports multiple simultaneous writers and readers, although there is normally only one reader at a time. It can record events from different sources in such a way that infrequent, but important events are not overwritten by high-frequency events. Yet it remains extremely simple, the current implementation clocking at a puny 618 lines of C++ including many comments. In that respect alone, it differs widely from designs such as the Linux tracing system.

This flight recorder design is, unsurprisingly, quite close to the one I designed for HPVM, which was instrumental in helping a small team bring up a new OS rapidly. At the same time, it includes several refinements I did not think of for HPVM, most notably the use of multiple ring buffers for multiple events, and the use of printf-style formatting for convenience. The implementation in ELFE is designed to help me make the ELFE language support multi-threaded execution (and, hopefully, to bring that to Tao3D later). However, I also wanted this design to be easy to transpose to the DxO ONE system software, where such a facility would be really helpful.

I felt it was interesting to share this implementation, because after discussing around at work and browsing various source codes, including the Linux kernel, it looks like this kind of design like this is much less widely known than I assumed.

Lock-free programming increases performance

Lock free programming is a more efficient way to write concurrent programs, where instead of preventing other CPUs from executing the same code with a lock or mutex, you write your code so that multiple CPUs can execute it at the same time. This usually involves using atomic operations, which are special processor operations designed to synchronize data across multiple CPUs. In the following discussion, I will need to atomic operations: an atomic add, and an atomic swap.

The atomic add increments a memory location correctly even if multiple CPUs execute it simultaneously. Consider the following code:

volatile int x;
x = x + 1;

You might think it is safe to use across multiple CPUs, since you used volatile. But this is not the case. Two CPUs might start executing the code at the same time, read the value 0 simultaneously, compute the new value of x as being 1, and then write 1 in x. In that case, despite having two CPUs doing the addition, the final value of x would be 1 (written twice in the same memory location by two different CPUs) and not 2.

Atomic operations ensure proper sequencing in such cases, and an atomic add will guarantee that the final value of x will be 2 in the above scenario, irrespective of the order in which the CPUs execute the instructions. In this context, atomic specifically does not mean that the CPUs execute the operations as a single instruction (although x86 generally does). ARM32 for example uses two instructions (LDREX and STREX) and a loop, the loop being re-evaluated only in the rare case of contention, meaning two CPUs bumped into each other trying to write to the same location in memory.

The atomic swap operation sets a location in memory only if it matches the value you knew was there, and returns the value that was there before. It allows you to safely update a linked list, for example, by updating the head of the list only if no other CPU updated it. If another CPU overwrote the head of the list, then you need to update the item you are about to insert with the updated head, and you can then retry. A lock-free list insertion using an atomic swap therefore looks like this:

void insert(Link **head, Link *newItem)
    Link *currentHead;
        currentHead = *head;
        newItem->next = currentHead;
    } while (!atomic_swap(*head, currentHead, newItem));

The loop re-evaluates only if another CPUs updated the head between the time we read currentHead and the time we try to update *head, which is a pretty short window where the only operation we do, in that case, is to update the link for the item we want to insert.

Lock-free programming also increases performance by not forcing you to mark all variables as volatile. There is actually some documentation in the Linux kernel explaining why volatile should be considered harmful for high-performance multi-CPU code.

Ring buffers

At its core, the flight recorder uses ring buffers to store event records. A ring buffer is something that is conceptually very simple: you address an array using a modulo operation, so that indexes “wrap around” the array size. Then, you have a writer index that moves forward as you write (called a head in many places in the Linux source code) and a reader index that moves forward as you read (called a tail in Linux source code).

It is relatively easy to make a simple ring buffer lock free. You simply need to use atomic add operations when updating the reader index or the writer index. There is one subtlety, however, which is that the copy of the data to or from the circular buffer may not be atomic with respect to the index update. This is why we may need a third index, the commit index, telling us where it’s safe to read from.

In that configuration, you first update the writer index, which tells you where to write, and after copying the data, you then update the commit index to tell the reader that the data is available. Multiple writers may then be writing in different “slots” of the ring buffer simultaneously. That leads us to the next subtlety : variable-size writes.

Ring buffers with variable-size writes

A frequent use for a ring buffer is to hold variable-sized messages. A good example is a prink-style facility, i.e. a way to print from practically any place in an operating system kernel. You write in a ring-buffer placed at a memory location that is available in all contexts, including for example interrupt context, and a task then reads back the circular buffer for example to send its contents to the console.

In that case, it is useful to be able to advance the writer index by a variable amount, corresponding to the size of the message being written. The algorithm then becomes something like:

  1. Advance the writer index to reserve some space. That way, the next writer will add its own message after yours, and messages will not be interleaved. If one CPU writes “Been There.” and the other writes “Done That.“, you want to get either “Been There. Done That.” or “Done That. Been There.” in the circular buffer, but not for example “BDoneeen T Thheart.
  2. Copy the data from the source into the circular buffer. This can take a variable amount of time, since the messages may have very different size.
  3. Advance the commit index to tell the reader that it’s safe to read the message.

That last step is a bit tricky. The best I could come with is to test if, when you have copied all data, the commit index is where the writer index was before you updated it.

  • If it is, then it means that you are the earliest writer. You can safely advance the commit index by the size of the message, and exit.
  • Otherwise, you need to spin until all earlier writers are done with their own copying. In other words, you wait until the commit index finally reaches the value the writer index had when you advanced it.

The need for spin

It may seem annoying to wait in a spin loop like this, and frankly, I spent a bit of time trying to find a way that avoided spinning. The alternatives I could think of were all much worse. The good news is that spinning is not that bad, because we spin for a duration that corresponds to the copy of the message into the ring buffer, which is on the same order as memcpy operations for the length of the message.

We are therefore talking about nanoseconds in the cases that are of practical interest for a printk-style use. So the chances that we need to spin, and if we need to, the amount of time we spin are both tiny.

There is still one problem, however, which is a potential priority inversion. Imagine that some code begins writing, and is interrupted on the same CPU by some higher-priority code that begins writing too. That other code is going to start spinning, and will never get a chance to yield to the lower-priority code. We have created a deadlock. This is why it is necessary to provide hooks that deal with the lock situation and can get us out of it, for example by yielding to the lower-priority code.

Recording events

Events, unlike printk messages, have a fixed size. We want to record a useful amount of information, while keeping the performances of the system high.

I settled for an event record that stores the following data:

  1. A printf-style format string. The format will be parsed only when dumping the recorder, so that we only copy pointers during recording, which is very fast.
  2. An event index that we obtain by atomically incrementing a global counter each time we record something. This allows us to order events globally across ring buffers.
  3. A timestamp, which I configured as a number of milliseconds since the program recorded its first entry. I selected milliseconds on 32-bit systems, which wraps around in 49 days, and microseconds for 64-bit systems (half a million years).
  4. The caller of the record call, which can for example be fed to a debugger to figure out where the call came from, in case the format string is not enough.
  5. Four arguments, stored as intptr_t values, which allows me to store either integers or pointers, as well as floating-values of float type for 32-bit systems, or double type for 64-bit systems.

This data structure has 8 fields which all have the size of a pointer, so it has always a power-of-two size. This makes the code generated when addressing the elements of the ring buffer more efficient on most CPU architectures.

Recording events uses operator() for convenience. For example, if you declare a recorder called EVENTS, you can record a particular event as follows:

EVENTS("Entering function foo(x=%d, y=%d, z=%f)", x, y, z);

Varargs and floating-point values

For convenience and ease of use, the recorder stores a format string in classical printf format. Ideally, we could simply format the output using a function like snprintf. Unfortunately, modern architectures like x86-64 or ARM no longer use the same registers to pass integer values (including pointers) and floating-point values.

For example, if you call foo(1,2.5,42), on ARM32, the value 1 gets passed in integer register R0, the value 2.5 in floating-point register D0, and 42 in integer register R1. This means that a variadic function like snprintf will not fetch the arguments from the same location depending on the format.

For that reason, the recorder dump needs to actually scan the format string at least enough to be able to determine if a given intptr_t in the event record is representing a floating-point or an integer/pointer value.

Testing using the Towers of Hanoi

We can use the classical example of the Towers of Hanoi to illustrate how to use the API and what its benefits are. First, consider an implementation of the algorithm using standard output to the console:

#include <iostream>

enum post { LEFT, MIDDLE, RIGHT };
const char *postName[] = { "LEFT", "MIDDLE", "RIGHT" };

void hanoi_print(int n, post left, post right, post middle)
    if (n == 1)
        std::cout << "Move disk from "
                  << postName[left] << " to "
                  << postName[right] << "\n";
        hanoi_print(n-1, left, middle, right);
        hanoi_print(1, left, right, middle);
        hanoi_print(n-1, middle, right, left);

A version that uses the flight recorder would instead look like this, using several recorders to record the moves (which interest us in general) and more detailed information about what is going on while executing the algorithm:

#include "flight_recorder.h"
ELFE::Recorder MOVE("Moves");         // Moves between Hanoi towers
ELFE::Recorder RECURSE("Recursion");  // Recursion events
ELFE::Recorder CALLS("Calls");        // Calls to the function

void hanoi_record(int n, post left, post right, post middle)
    CALLS("n=%d, left=%-6s, right=%-6s, middle=%-6s",
          n, postName[left], postName[right], postName[middle]);

    if (n == 1)
        MOVE("Move disk from %s to %s", postName[left], postName[right]);
        RECURSE("Recurse #1 n=%d", n);
        hanoi_record(n-1, left, middle, right);
        RECURSE("Recurse #2 n=%d", n);
        hanoi_record(1, left, right, middle);
        RECURSE("Recurse #3 n=%d", n);
        hanoi_record(n-1, middle, right, left);

We can now compare the two implementations very simply:

ELFE::Recorder TIMING("Timing");      // Timing
int main(int argc, char **argv)
    for (int i = 1; i < argc; i++)
        int count = atoi(argv[i]);
        TIMING("Begin printing Hanoi with %d", count);
        hanoi_print(count, LEFT, MIDDLE, RIGHT);
        TIMING("End printing Hanoi with %d", count);
        TIMING("Begin recording Hanoi with %d", count);
        hanoi_record(count, LEFT, MIDDLE, RIGHT);
        TIMING("End recording Hanoi with %d", count);

Here is the Timing output of this program on my laptop for a problem size of 6:

0 [0.000000:0x10cd52060] Timing: Begin printing Hanoi with 6
1 [0.027581:0x10cd520c4] Timing: End printing Hanoi with 6
2 [0.027582:0x10cd52114] Timing: Begin recording Hanoi with 6
253 [0.027638:0x10cd5218a] Timing: End recording Hanoi with 6

It’s interesting to see that, despite having much more information about what is going on, the flight-recorder solution takes only 56us, whereas the solution that just prints the moves to the standard output takes 27500us, almost 500 times slower.

And then, of course, you have all the records for what happened during the execution of that small program:

FLIGHT RECORDER DUMP  AT Fri Dec  9 09:49:22 2016
14 [0.027585:0x10cd51c5a] Moves: Move disk from LEFT to RIGHT
17 [0.027586:0x10cd51c5a] Moves: Move disk from LEFT to MIDDLE
20 [0.027587:0x10cd51c5a] Moves: Move disk from RIGHT to MIDDLE
23 [0.027587:0x10cd51c5a] Moves: Move disk from LEFT to RIGHT
28 [0.027588:0x10cd51c5a] Moves: Move disk from MIDDLE to LEFT
DUMPING Recursion SIZE 128, 93 ENTRIES
4 [0.027583:0x10cd51caf] Recursion: Recurse #1 n=6
6 [0.027584:0x10cd51caf] Recursion: Recurse #1 n=5
8 [0.027584:0x10cd51caf] Recursion: Recurse #1 n=4
10 [0.027585:0x10cd51caf] Recursion: Recurse #1 n=3
12 [0.027585:0x10cd51caf] Recursion: Recurse #1 n=2
3 [0.027583:0x10cd51be5] Calls: n=6, left=LEFT  , right=MIDDLE, middle=RIGHT 
5 [0.027584:0x10cd51be5] Calls: n=5, left=LEFT  , right=RIGHT , middle=MIDDLE
7 [0.027584:0x10cd51be5] Calls: n=4, left=LEFT  , right=MIDDLE, middle=RIGHT 
9 [0.027584:0x10cd51be5] Calls: n=3, left=LEFT  , right=RIGHT , middle=MIDDLE
11 [0.027585:0x10cd51be5] Calls: n=2, left=LEFT  , right=MIDDLE, middle=RIGHT 
13 [0.027585:0x10cd51be5] Calls: n=1, left=LEFT  , right=RIGHT , middle=MIDDLE
16 [0.027586:0x10cd51be5] Calls: n=1, left=LEFT  , right=MIDDLE, middle=RIGHT 
19 [0.027587:0x10cd51be5] Calls: n=1, left=RIGHT , right=MIDDLE, middle=LEFT  
22 [0.027587:0x10cd51be5] Calls: n=1, left=LEFT  , right=RIGHT , middle=MIDDLE
25 [0.027588:0x10cd51be5] Calls: n=2, left=MIDDLE, right=RIGHT , middle=LEFT  
27 [0.027588:0x10cd51be5] Calls: n=1, left=MIDDLE, right=LEFT  , middle=RIGHT 

Performance and scalability

This performance gain is one of the primary reasons you should use a flight recorder instead of standard printouts. The recording performance proposed implementation was tested on a variety of systems.

Platform Time to record
8-core 2.8GHz Core i7 (1 thread) ~55ns
8-core 2.8GHz Core i7 (256 threads) ~60ns
500MHz 32-bit Cortex A9 450ns

This means that except under the most stringent conditions, it is possible to record practically anything that is happening in the system as it is happening, and replay it back later, if and when you need it.


A flight recorder allows you to instrument your code very precisely without being intrusive. It collects detailed information about the execution of your program, including precise timing. By recording data in multiple ring buffers, it lets you preserve old, but relevant data while continuously recording the latest high-frequency events.

Later updates

Some experience integrating the flight recorder into other projects prompted me to rewrite it in straight C. Macros instead of templates. So you write RECORD(TIMING, "Duration %u", duration) instead of TIMING("Duration %u", duration). But otherwise, it’s quite similar.

The C implementation initially was not type-safe for floating-point values. You had to manually wrap them in D2I and F2I calls. I later found a way to fix that problem by using a C11 feature, _Generic. This lets the recorder code automatically dispatch to the correct function.

I’m currently attempting to leverage flight-recorder instrumentation for other uses, in order to avoid the need to multiply instrumentation probes. Among the candidate targets for the collected data: tracing systems such as dtrace or systemtap, the ability to immediately print selected messages, and a more difficult one, continuous data logging (e.g. for performance data).

7 thoughts on “A real-time, lock-free, multi-CPU flight recorder

  1. Hi foo, No I did not know about Dapper. It seems t be Java based, and the links to the code source seem to be dead. I like the idea of recording beginning and end of operations. Regarding the hierarchy, I am not sure it would help me in any scenario I can think of. I’ll dig deeper.

  2. Also, they mention they use thread local storage. Then, I can understand they get 40ns per record. But one of the key features of the proposed implementation is that it provides a global order across CPUs. Something that is easy to do on a SMP system, harder on a truly distributed system.

  3. I suspect Dapper at Google is written in C++, though indeed the folks at Twitter replicated the concept in Java and called that project Zipkin. There is now a small community around the project to further the use of this concept.

    Dapper as a concept isn’t limited to Java. I know of a C implementation that is used in production to generate performance data that gives insight like no other tool, and not just across distributed processes, but also inside complex processes, with microsecond precision on timestamps.

    Dapper clearly is different from a FDR (flight data recorder), if only because in general it relies on sampling and thus is incomplete (though 100% sampling is often possible and useful). Also, no global order, but instead a different kind of order: it records causality of operations, that is “operation X starts while current thread is accomplishing operation Y”, which can be shown as a tree of boxes in a timeline view, for example:

    This timeline view is extremely useful in investigating root causes of latency.

    Given that the number of CPU cores in a computer keeps increasing, just like the number of computers running a single application, you may well want to structure modern applications into agents that make good use of each core and communicate when needed: there you have a distributed system again! 🙂

    Now, I do think that both FDR-like and Dapper-like tracing can benefit from using the same low-overhead tracing infrastructure, with variants for either case. And while your blog doesn’t mention it, there is also great benefit for both cases in spending a bit of time in visualization tools: in both cases, we generate huge amounts of data, and finding the needle in the haystack becomes itself an interesting challenge!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s