User-space distributed tracing with BPF

October 3, 2021

Introduction

Tracepoints provide a way to instrument running programs dynamically in order to aid debugging. When no probe is attached, they are low cost (a single no-op instruction in the instruction stream). When a probe is attached, the no-op is replaced with a debug trap, and the probe provides a small BPF program which runs at that point.

This means that these tracepoints can be left on production systems, and used to help diagnose issues without having to use a different build. They seem particularly useful for analysing timings of events to isolate latency bottlenecks. Of course a debug trap entails a context switch, so an attached probe will incur a performance penalty.

An example

The following is a pretty minimal example of a traceable program. It uses the systemtap tracepoint library (only the header is needed) to add a probe. On Debian this header is provided by the systemtap-sdt-dev package.

/*
 * probe-test.c
 *
 * Build with
 *   gcc -g probe-test.c -o probe-test
 * Debug symbols are useful for aiding dissasembly in gdb
 */

#include <sys/sdt.h>

#include <stdbool.h>
#include <stdint.h>
#include <unistd.h>

int
main(int argc, char* argv[])
{
    uint32_t ctr = 0;

    while(true)
    {
        ctr++;
        DTRACE_PROBE1(probe-test, increment, ctr);
        sleep(1);
    }

    return 0;
}

DTRACE_PROBE1 defines a probe with one argument, named increment, namespaced by probe-test.

bpfcc allows us to write small pythons scripts which compile small BPF programs, and to attach these to the probe points.

#
# trace-probe-test.py
#
import sys
from bcc import BPF, USDT

pid = sys.argv[1]

bpf_text = """
#include <uapi/linux/ptrace.h>

int
do_trace(struct pt_regs* ctx)
{
    uint32_t ctr;
    bpf_usdt_readarg(1, ctx, &ctr);
    bpf_trace_printk("%u\\n", ctr);
    return 0;
}
"""

proc = USDT(pid=int(pid))
proc.enable_probe(probe="increment", fn_name="do_trace")

bpf = BPF(text=bpf_text, usdt_contexts=[proc])

print("Attaching to increment probe")
while True:
    try:
        task, pid, cpu, flags, ts, ctr = bpf.trace_fields()
    except ValueError:
        print("value error")
    except KeyboardInterrupt:
        exit()
    print("CTR:\t{}".format(ctr.decode('utf=8')))

We can then launch the binary, and attach the probe:

$ ./probe-test &
[1] 841361

$ sudo python trace-probe-test.py 841361
# ...Compiler output elided...
Attaching to increment probe
CTR:    39
CTR:    40
CTR:    41
CTR:    42
CTR:    43
CTR:    44

Under the covers

It's possible to see how the probe is implemented by attaching gdb to the running process and dissassembling main.

Here is the instruction stream of main when no probe is attached:

(gdb) disassemble main
Dump of assembler code for function main:
   0x000056532a057135 <+0>:     push   %rbp
   0x000056532a057136 <+1>:     mov    %rsp,%rbp
   0x000056532a057139 <+4>:     sub    $0x20,%rsp
   0x000056532a05713d <+8>:     mov    %edi,-0x14(%rbp)
   0x000056532a057140 <+11>:    mov    %rsi,-0x20(%rbp)
   0x000056532a057144 <+15>:    movl   $0x0,-0x4(%rbp)
   0x000056532a05714b <+22>:    addl   $0x1,-0x4(%rbp)
   0x000056532a05714f <+26>:    nop
   0x000056532a057150 <+27>:    mov    $0x1,%edi
   0x000056532a057155 <+32>:    call   0x56532a057030 <sleep@plt>
   0x000056532a05715a <+37>:    jmp    0x56532a05714b <main+22>

Notice the nop instruction at offset 26. When a probe is attached this is replaced by an int3 instruction, which is typically used as a trap for a debugger.

(gdb) disassemble main
Dump of assembler code for function main:
   0x000056532a057135 <+0>:     push   %rbp
   0x000056532a057136 <+1>:     mov    %rsp,%rbp
   0x000056532a057139 <+4>:     sub    $0x20,%rsp
   0x000056532a05713d <+8>:     mov    %edi,-0x14(%rbp)
   0x000056532a057140 <+11>:    mov    %rsi,-0x20(%rbp)
   0x000056532a057144 <+15>:    movl   $0x0,-0x4(%rbp)
   0x000056532a05714b <+22>:    addl   $0x1,-0x4(%rbp)
   0x000056532a05714f <+26>:    int3
   0x000056532a057150 <+27>:    mov    $0x1,%edi
   0x000056532a057155 <+32>:    call   0x56532a057030 <sleep@plt>
   0x000056532a05715a <+37>:    jmp    0x56532a05714b <main+22>

How do we know where to locate the probe point, and if we have many probes, how do we know which is which? The DTRACE_PROBE macros add notes to the ELF file detailing these locations. First, from objdump we can see the offset of the probe is 114f in the .text section of the ELF file:

$ objdump -S probe-test
...
Disassembly of section .text:
...
int
main(int argc, char* argv[])
{
    1135:       55                      push   %rbp
    1136:       48 89 e5                mov    %rsp,%rbp
    1139:       48 83 ec 20             sub    $0x20,%rsp
    113d:       89 7d ec                mov    %edi,-0x14(%rbp)
    1140:       48 89 75 e0             mov    %rsi,-0x20(%rbp)
    uint32_t ctr = 0;
    1144:       c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%rbp)

    while(true)
    {
        ctr++;
    114b:       83 45 fc 01             addl   $0x1,-0x4(%rbp)
        DTRACE_PROBE1(probe-test, increment, ctr);
    114f:       90                      nop
        sleep(1);
    1150:       bf 01 00 00 00          mov    $0x1,%edi
    1155:       e8 d6 fe ff ff          callq  1030 <sleep@plt>
        ctr++;
    115a:       eb ef                   jmp    114b <main+0x16>
    115c:       0f 1f 40 00             nopl   0x0(%rax)

...

And that the file has a note providing the location of this probe:

$ readelf -n probe-test
...
Displaying notes found in: .note.stapsdt
  Owner                Data size        Description
  stapsdt              0x00000038       NT_STAPSDT (SystemTap probe descriptors)
    Provider: probe-test
    Name: increment
    Location: 0x000000000000114f, Base: 0x0000000000002004, Semaphore: 0x0000000000000000
    Arguments: 4@-4(%rbp)

References