Full-system dynamic tracing on Linux using eBPF and bpftrace

Linux has two well-known tracing tools:

strace allows you to see what system calls are being made.
ltrace allows you to see what dynamic library calls are being made.

Though useful, these tools are limited. What if you want to trace what happens inside a system call or library call? What if you want to do more than just logging calls, e.g. you want to compile statistics on certain behavior? What if you want to trace multiple processes and correlate data from multiple sources?

In 2019, there's finally a decent answer to that on Linux: bpftrace, based on eBPF technology. Bpftrace allows you to write small programs that execute whenever an event occurs.

This article shows you how to setup bpftrace and teaches you its basic usage. I'll also give an overview of how the tracing ecosystem looks like (e.g. "what's eBPF?") and how it came to be what it is today.

Table of contents

What is tracing?
DTrace: the father of tracing
The Linux tracing ecosystem
Installing bpftrace
- Installing prerequisites
- Main bpftrace installation
One-liner examples
Script syntax & I/O timing example
USDT probe example
Glibc ptmalloc memory arena allocation example
What trace points are available? What should I trace?
- Tip: kernel tracepoint structure format
Conclusion

What is tracing?

As stated before, bpftrace allows you to write small programs that execute whenever an event occurs.

What's an event? It could be a system call, a function call, or even something that happens inside such a call. It could also be a timer or hardware event, e.g. "50ms has elapsed since last time", "a page fault occurred", "a context switch occurred" or "a CPU cache miss occurred".
What can you do in response to an event? You can log something, compile statistics and execute arbitrary shell commands. You will have access to a variety of context information such as the current PID, stack trace, time, call arguments, return value, etc.
What are the use cases? Lots. You can figure out why an app is slow by compiling a list of slowest calls. You can figure out whether (and why) there is a memory leak. I use it to figure out why Ruby is using so much memory.

The great thing about bpftrace is that you don't need to recompile apps. No need to manually add print calls or other debugging code in the target app's source code. There isn't even a need to restart apps. All of this, with low overhead to boot. This makes bpftrace especially useful for debugging production systems, or in other scenarios where compiling is a problem.

DTrace: the father of tracing

For a long time, the best answer to tracing is DTrace, a full-system dynamic tracing framework originally developed by Sun Microsystems (the inventors of Java). Just like bpftrace, DTrace allows one to write small programs that execute in response to events. In fact, bpftrace and many key pieces of the ecosystem are heavily contributed to by Brendan Gregg, a prominent expert on DTrace who now works at Netflix. Which explains the similarities between DTrace and bpftrace.

DTrace presentation — Solaris DTrace introduction (2009) by S. Tripathi, Sun Microsystems

At some point, Sun open sourced DTrace. Today, DTrace is available in Solaris, FreeBSD, and macOS (though the macOS version is as good as defunct because System Integrity Protection broke many aspects of DTrace).

You saw that right… Linux is absent from that list. It's not an engineering issue but a licensing issue. DTrace was licensed under the CDDL instead of GPL. A port of DTrace to Linux has been available since 2011, but it has never been adopted by the upstream Linux developers. Early 2018, Oracle relicensed DTrace under the GPL, but by then it was already far too late.

The Linux tracing ecosystem

There's no doubt that tracing is very useful, and so the Linux community sought to develop their own solutions. But unlike Solaris, Linux is not governed by a single vendor, and so there was no concentrated effort to develop a full-fledged DTrace alternative. Instead, the Linux tracing ecosystem grew slowly and organically, solving problems piecemeal. Only recently has the ecosystem grown to be a serious competitor to DTrace.

Because of the organic growth, the ecosystem can feel a bit chaotic, consisting of many different components. Luckily, Julia Evans wrote an overview of the ecosystem (note: 2017, bpftrace did not exist).

The Linux tracing ecosystem as described by Julia Evans

Not everything is equally important. Let me summarize what I think are the most important pieces.

Event sources

Event data can come from the kernel or from userspace (apps and libraries). Some of them are automatically available without further upstream developer effort, others require manual annotations.

On the kernel side, kprobes is the mechanism that allows tracing any function call inside the kernel. This way you can trace system calls, but also what happens inside system calls (because system call entrypoints call other internal functions). You can also use kprobes to trace non-syscall kernel events, e.g. "buffered data is being flushed to disk", "a TCP packet is being sent over the network" or "a context switch is now occurring".

Kernel tracepoints allows tracing custom events that the kernel developers have defined. These events are not on the level of function calls. Kernel developers manually place TRACE_EVENT macros in the kernel code to define such tracepoints.

There are pros and cons to both. Kprobes work "automatically" because they do not require kernel developers to manually annotate their code, but kprobe events can change arbitrarily from kernel version to version because functions can be added, removed or renamed at any time.

Kernel tracepoints tend to stay more consistent over time, and can provide useful context information that may not be available in kprobes. With kprobes, one can access function call arguments. But with kernel tracepoints one can access any information the kernel developer decided to manually annotate.
On the userspace side, uprobes is like kprobes, but works on the userspace level. It's for tracing userspace function calls.

USDT probes ("Userland Statically Defined Tracing") are like kernel tracepoints for userspace. App developers have to manually annotate their code with USDT probes.

An interesting trivia: DTrace has long provided a C API for defining the DTrace-equivalent of USDT probes (through the DTRACE_PROBE macro). The Linux tracing ecosystem developers decided to stay source-compatible with that API, so any DTRACE_PROBE macros are automatically converted to USDT probes!

So in theory, strace can be implemented by using kprobes, and ltrace can be implemented using uprobes. I am unsure whether they already do that or not.

Frontends

Frontends are apps that allow users to easily make use of the event sources.

Let's consider how the event sources work. They roughly operate like this:

The kernel exposes a mechanism – typically some /proc or /sys file that you can write to – to register an intent to trace an event and what should happen when an event occurs.
Once registered, the kernel looks up the location in memory of the kernel/userspace function/tracepoint/USDT-probe, and modifies its code so that something else happens.
The result of that "something else" can be collected later through some mechanism.

You don't want to do all this by hand! So that's where frontends come in: they do all that stuff for you.

Frontends come in many flavors. In the area of eBPF-based frontends, some are very low-level and require you to intimately understand how to interface with the event sources and how eBPF bytecode works. Others are high-level and easy-to-use, but historically such frontends have had limited flexibility.

And that's why bpftrace – the newest frontend – is my favorite. It is easy and flexible, like DTrace. But it's quite new and so rough around the edges.

eBPF

eBPF is the new hotness in Linux tracing land and is what powers bpftrace. So when you trace an event, you want "something" to happen in the kernel. How do you define what "something" is in a flexible way? Why, with a programming language (or, computer-science-equivalently, with some sort of machine code) of course.

But what language, or what machine code? Native machine code is very dangerous, because when run by the kernel you can easily corrupt the kernel, bypass security, etc. Wouldn't it be nice if the language or machine code in question is limited in power (can only access safe portions of memory), and non-turing-complete (guaranteed to halt)?

An attacker tells kernel to run some x86 code whenever a TCP packet is sent. The code then emails kernel secrets to the attacker. — We don't want this to happen

Enter eBPF, short for Extended Berkeley Packet Filter. It's a high-performance virtual machine that runs in the kernel with the following properties/limitations:

All interactions with userspace happen through eBPF "maps" which are key-value stores.
There are no loops, so every eBPF program will finish within some bounded execution time.

But wait, "packet filter"? Yep: this was originally developed for filtering network packets. It's a surprisingly similar problem: when a packet is received (an event occurred), you want to perform some sort of administrator-defined action (reject, drop, forward, log, etc). This has to be fast, and so a virtual machine (with optional JIT compilation) was invented. The "extended" part refers to the fact eBPF is an improvement on the original Berkeley Packet Filter, in the sense that it's usable outside a networking context.

So there you go. With bpftrace you define what events to trace, and what should happen in response. Bpftrace compiles your high-level-bpftrace-language program to eBPF bytecode, listens on events and uploads the bytecode to the kernel.

The dark days before eBPF

Before eBPF entered the scene, the solutions were awkward, to say the least. SystemTap – which is sort-of the "most serious" Linux-land predecessor to bpftrace – compiles SystemTap scripts into C which in turn is compiled into a kernel module. That kernel module is then loaded.

This approach was very fragile and was not well-supported outside of RHEL. I never got it to work well on Ubuntu, which had a tendency to break SystemTap on every kernel upgrade due to changing kernel data structures. It was also slow because on every invocation of SystemTap it had to compile a kernel module. It is also said that SystemTap in the early days easily caused kernel panics.

Installing bpftrace

Let's get our hands dirty! This tutorial teaches you to use bpftrace on Ubuntu 18.04. Newer Ubuntu versions are not recommended because we need to install some software for which packages have not been built for newer Ubuntu versions.

Installing prerequisites

First, install Clang 7, libclang 7 and LLVM 7, including all development headers. We'll use packages provided by llvm.org because the ones in Ubuntu have an issue.

wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -
cat <<EOF | sudo tee -a /etc/apt/sources.list
deb http://apt.llvm.org/bionic/ llvm-toolchain-bionic-7 main
deb-src http://apt.llvm.org/bionic/ llvm-toolchain-bionic-7 main
EOF
sudo apt update
sudo apt install clang-7 libclang-7-dev llvm-7 llvm-7-dev

Next, install these:

sudo apt install bison cmake flex g++ git libelf-dev zlib1g-dev libfl-dev

Finally, install libbpfcc-dev from upstream packages instead of from the Ubuntu repository. The one in Ubuntu is missing a header. This issue has not been fixed even in Ubuntu 18.10.

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 4052245BD4284CDD
echo "deb https://repo.iovisor.org/apt/$(lsb_release -cs) $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/iovisor.list
sudo apt update
sudo apt install bcc-tools libbcc-examples linux-headers-$(uname -r)

Main bpftrace installation

Time to install bpftrace itself from source! Let's git clone it, compile it, and install it to /usr/local:

git clone https://github.com/iovisor/bpftrace
cd bpftrace
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=DEBUG ..
make -j4
sudo make install

And you're done! The bpftrace binary will be in installed in /usr/local/bin/bpftrace. You can change the install location using an argument to cmake, where the default is -DCMAKE_INSTALL_PREFIX=/usr/local.

One-liner examples

Let's try running a few bpftrace one-liner programs to get a feel of what you can do. These are from Brendan Gregg's One-Liners Tutorial, which has more details on each.

 # 1. Listing probes
bpftrace -l 'tracepoint:syscalls:sys_enter_*'

 # 2. Hello world
bpftrace -e 'BEGIN { printf("hello world\n"); }'

 # 3. File opens
bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %s\n", comm, str(args->filename)); }'

 # 4. Syscall counts by process
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

 # 5. Distribution of read() bytes
bpftrace -e 'tracepoint:syscalls:sys_exit_read /pid == 18644/ { @bytes = hist(args->retval); }'

 # 6. Kernel dynamic tracing of read() bytes
bpftrace -e 'kretprobe:vfs_read { @bytes = lhist(retval, 0, 2000, 200); }'

 # 7. Timing read()s
bpftrace -e 'kprobe:vfs_read { @start[tid] = nsecs; }
    kretprobe:vfs_read /@start[tid]/ { @ns[comm] = hist(nsecs - @start[tid]); delete(@start[tid]); }'

 # 8. Count process-level events
bpftrace -e 'tracepoint:sched:sched* { @[name] = count(); } interval:s:5 { exit(); }'

 # 9. Profile on-CPU kernel stacks
bpftrace -e 'profile:hz:99 { @[stack] = count(); }'

 # 10. Scheduler tracing
bpftrace -e 'tracepoint:sched:sched_switch { @[stack] = count(); }'

 # 11. Block I/O tracing
bpftrace -e 'tracepoint:block:block_rq_complete { @ = hist(args->nr_sector * 512); }'

See Brendan Gregg's website to learn what output the above commands can generate.

Script syntax & I/O timing example

The string passed to '-e' is the bpftrace script's contents. The syntax is roughly a collection of:

<event source> /<optional filter>/ { <program body> }

Let's deconstruct example #7, the one about timing filesystem I/O read operations:

kprobe:vfs_read { @start[tid] = nsecs; }
<- 1 -><-- 2 -> <---------- 3 --------->

Trace an event from the kprobe mechanism, i.e. we're tracing the beginning of a kernel function.
The kernel function to trace is vfs_read, which is a function called when the kernel performs a read operation on a filesystem (VFS stands for "Virtual FileSystem", the filesystem abstraction framework in the kernel).
When vfs_read begins (i.e. before the function has done any work), the bpftrace program runs. It stores the current timestamp (in nanoseconds) into a global associative array variable named @start. The key used is tid, which refers to the current thread's ID.

kretprobe:vfs_read /@start[tid]/ { @ns[comm] = hist(nsecs - @start[tid]); delete(@start[tid]); }
<-- 1 --> <-- 2 -> <---- 3 ----> <----------------------------- 4 ----------------------------->

Trace an event from the kretprobe mechanism, which is like kprobe except it fires when the kernel function returns.
The kernel function to trace is vfs_read.
This is an optional filter. This checks whether the start time was previously recorded. Without this filter, this program may be run during a read and only catch the end, resulting in a time calculation of nsecs - 0, instead of nsecs - start.
The program body.
- nsecs - @start[tid] calculates how much time was spent since the beginning of the vfs_read function.
- @ns[comm] = hist(...) adds the given data to a power-of-two histogram, which is stored in @ns. The key comm refers to the current app's command name. So we'll have a histogram on a per-command basis.
- delete(...) removes the start time from the associative array because we no longer need it.

This is the final output. Note that all histograms are automatically printed. There is no need for an explicit 'print histogram' command. @ns is not a special variable so it is not the cause of the histogram being printed.

@ns[snmp-pass]:
[0, 1]                 0 |                                                    |
[2, 4)                 0 |                                                    |
[4, 8)                 0 |                                                    |
[8, 16)                0 |                                                    |
[16, 32)               0 |                                                    |
[32, 64)               0 |                                                    |
[64, 128)              0 |                                                    |
[128, 256)             0 |                                                    |
[256, 512)            27 |@@@@@@@@@                                           |
[512, 1k)            125 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
[1k, 2k)              22 |@@@@@@@                                             |
[2k, 4k)               1 |                                                    |
[4k, 8k)              10 |@@@                                                 |
[8k, 16k)              1 |                                                    |
[16k, 32k)             3 |@                                                   |
[32k, 64k)           144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64k, 128k)            7 |@@                                                  |
[128k, 256k)          28 |@@@@@@@@@@                                          |
[256k, 512k)           2 |                                                    |
[512k, 1M)             3 |@                                                   |
[1M, 2M)               1 |                                                    |

USDT probe example

Let's take this sample C code and store it in tracetest.c:

 #include <sys/sdt.h>
 #include <sys/time.h>
 #include <unistd.h>
 #include <stdio.h>

static long
myclock() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    DTRACE_PROBE1(tracetest, testprobe, tv.tv_sec);
    return tv.tv_sec;
}

int
main(int argc, char **argv) {
    while (1) {
        myclock();
        sleep(1);
    }
    return 0;
}

This program loops infinitely, calling myclock() once a second. myclock() queries the current time and returns the number of seconds since epoch.

The DTRACE_PROBE1 call here defines a static USDT tracing point.

The DTRACE_PROBE1 macro comes from sys/sdt.h. The official USDT macro that does the same thing is called STAP_PROBE1 (STAP is for SystemTap, which is the first Linux mechanism that supported USDT). But since USDT is source-compatible with DTrace userland probes, DTRACE_PROBE1 is simply an alias for STAP_PROBE1.
The 1st parameter is the provider name. I believe this is a leftover thing from DTrace, because bpftrace doesn't seem to do anything useful with it. However there is a caveat (which I discovered while debugging issue 328): the provider name must be equal to the app binary's name, otherwise bpftrace cannot find this tracepoint.
The 2nd parameter is the tracepoint's own name.
Any additional parameters are developer-supplied context. The 1 in DTRACE_PROBE1 means that we want to pass 1 additional parameter.

Let's ensure sys/sdt.h is available, and compile the program:

sudo apt install systemtap-sdt-dev
gcc tracetest.c -o tracetest -Wall -g

We instruct bpftrace to print the PID and "time is [number]" whenever testprobe is reached:

sudo bpftrace -e 'usdt:/full-path-to/tracetest:testprobe { printf("%d: time is %d\n", pid, arg0); }'

Bpftrace keeps running until we press Ctrl-C. So open a new terminal and run tracetest in there:

 # In a new terminal
./tracetest

Go back to the first terminal with bpftrace, and you should see something like:

Attaching 1 probe...
30909: time is 1549023215
30909: time is 1549023216
30909: time is 1549023217
...
^C

Glibc ptmalloc memory arena allocation example

I use bpftrace to research why Ruby uses so much memory. As part of my research, I need to understand how the glibc memory allocator uses memory arenas.

In order to optimize multicore performance, the glibc memory allocator allocates multiple "arenas" from the OS. When the app asks the memory allocator to allocate memory, then the memory allocator picks an arena which is not in use, and marks a piece of that arena as "in use". By having threads use different arenas, lock contention is reduced, resulting in better multithreaded performance.

But this approach can also result in a lot of waste, and it appears that this waste is what causes Ruby's memory usage to be so high. In order to better understand the nature of this waste, I asked myself: what does "picking an arena which is not in use" mean? It could mean one of:

On every malloc() call, the allocator iterates through all arenas, and picks one which is currently not locked. Only if everything is locked, will the allocator attempt to create a new arena.
On the first malloc() call of a specific thread (or when a thread starts), the allocator will pick a not-currently-locked arena. If everything is locked, the allocator attempts to create a new arena.
On the first malloc() call of a specific thread (or when a thread starts), the allocator attempts to create a new arena regardless of whether there is already an unlocked one. Only if the arena cannot be created (e.g. because the limit is reached) will it reuse an existing one.
There are probably more possibilities I haven't thought of.

The literature isn't really clear which one of these possibilities it is. I dived into the glibc source code, which suggested that the behavior is 3. But I wanted to experimentally verify whether I interpreted the source code right, without having to insert debugging statements into glibc.

This is the glibc memory allocator function that creates a new arena. It is only called after passing the limit checks.

static mstate
_int_new_arena(size_t size)
{
    mstate arena;

    size = calculate_how_much_memory_to_ask_from_os(size);
    arena = do_some_stuff_to_allocate_memory_from_os();

    LIBC_PROBE(memory_arena_new, 2, arena, size);

    do_more_stuff();
    return arena;
}

Can we use uprobes to trace the _int_new_arena function? Unfortunately no. For whatever reason, this symbol is not accessible in Ubuntu 18.04's glibc. Not even after installing the debugging symbols.

Luckily there is a USDT probe in that function. LIBC_PROBE is a macro that's an alias for STAP_PROBE.

The provider name is libc.
The probe name is memory_arena_new.
The 2 means that there are 2 additional developer-provided arguments.
arena is the address of the arena that was allocated from the OS, and size is how big the arena is.

Before we can use this probe, we need to work around issue 328. We need to symlink glibc to a path whose basename is exactly libc, because bpftrace expects the library name (which would otherwise be libc-2.27.so) to be equal to the provider name (libc).

ln -s /lib/x86_64-linux-gnu/libc-2.27.so /tmp/libc

Now we instruct bpftrace to attach to the USDT probe memory_arena_new, and whose provider name is libc:

sudo bpftrace -e 'usdt:/tmp/libc:memory_arena_new { printf("PID %d: created new arena at %p, size %d\n", pid, arg0, arg1); }'

In another terminal, we run Ruby, telling it to create 3 threads that do nothing, then exit after 1 second. Due to Ruby's global interpreter lock, malloc() should not be called concurrently by different threads.

ruby -e '3.times { Thread.new { } }; sleep 1'

Going back to the bpftrace terminal, we see:

Attaching 1 probe...
PID 431: created new arena at 0x7f40e8000020, size 576
PID 431: created new arena at 0x7f40e0000020, size 576
PID 431: created new arena at 0x7f40e4000020, size 576

So this answers our question! Every time a new Ruby thread is created, glibc allocates a new arena regardless of contention.

What trace points are available? What should I trace?

You can list all the hardware, timer, kprobe and kernel static tracepoints by running:

sudo bpftrace -l

You can list all the uprobe tracepoints (function symbols) of an app or library by running:

nm /path-to-binary

You can list all the USDT tracepoints of an app or library by running:

/usr/share/bcc/tools/tplist -l /path-to/binary

As for which tracepoints you should use: that requires some knowledge about the source code of the thing you want to trace. I recommend you to study the source code.

Tip: kernel tracepoint structure format

Here's a handy tip when tracing kernel tracepoints. You can inspect what argument fields are available by reading a file in /sys/kernel/debug/tracing/events!

For example, suppose that you want to trace madvise(..., MADV_DONTNEED) calls:

sudo bpftrace -l | grep madvise tells us that we can use the tracepoint:syscalls:sys_enter_madvise.

sudo cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_madvise/format gives us this:

name: sys_enter_madvise
ID: 569
format:
    field:unsigned short common_type;   offset:0;   size:2; signed:0;
    field:unsigned char common_flags;   offset:2;   size:1; signed:0;
    field:unsigned char common_preempt_count;   offset:3;   size:1; signed:0;
    field:int common_pid;   offset:4;   size:4; signed:1;

    field:int __syscall_nr; offset:8;   size:4; signed:1;
    field:unsigned long start;  offset:16;  size:8; signed:0;
    field:size_t len_in;    offset:24;  size:8; signed:0;
    field:int behavior; offset:32;  size:8; signed:0;

print fmt: "start: 0x%08lx, len_in: 0x%08lx, behavior: 0x%08lx", ((unsigned long)(REC->start)), ((unsigned long)(REC->len_in)), ((unsigned long)(REC->behavior))

Madvise's signature according to the manual page is (void *addr, size_t length, int advice). The last three fields in this struct correspond to those parameters!

What's MADV_DONTNEED's value? grep MADV_DONTNEED /usr/include tells us that it's 4:

/usr/include/x86_64-linux-gnu/bits/mman-linux.h:80:# define MADV_DONTNEED     4 /* Don't need these pages.  */

So, our bpftrace command becomes:

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_madvise /args->behavior == 4/ { printf("madvise DONTNEED called\n"); }'

Conclusion

Bpftrace is wonderful! Bpftrace is the future!

If you want to learn more about it, then I recommend that you read its reference guide, as well as Brendan Gregg's 2019 introductory blog post.

Happy debugging.