User-space distributed tracing with BPF
October 3, 2021
Introduction
Tracepoints provide a way to instrument running programs dynamically in order to aid debugging. When no probe is attached, they are low cost (a single no-op instruction in the instruction stream). When a probe is attached, the no-op is replaced with a debug trap, and the probe provides a small BPF program which runs at that point.
This means that these tracepoints can be left on production systems, and used to help diagnose issues without having to use a different build. They seem particularly useful for analysing timings of events to isolate latency bottlenecks. Of course a debug trap entails a context switch, so an attached probe will incur a performance penalty.
An example
The following is a pretty minimal example of a traceable program. It uses the systemtap
tracepoint library (only the header is needed) to add a probe. On Debian this header is provided by the systemtap-sdt-dev
package.
/*
* probe-test.c
*
* Build with
* gcc -g probe-test.c -o probe-test
* Debug symbols are useful for aiding dissasembly in gdb
*/
#include <sys/sdt.h>
#include <stdbool.h>
#include <stdint.h>
#include <unistd.h>
int
main(int argc, char* argv[])
{
uint32_t ctr = 0;
while(true)
{
ctr++;
DTRACE_PROBE1(probe-test, increment, ctr);
sleep(1);
}
return 0;
}
DTRACE_PROBE1
defines a probe with one argument, named increment
, namespaced by probe-test
.
bpfcc
allows us to write small pythons scripts which compile small BPF
programs, and to attach these to the probe points.
#
# trace-probe-test.py
#
import sys
from bcc import BPF, USDT
pid = sys.argv[1]
bpf_text = """
#include <uapi/linux/ptrace.h>
int
do_trace(struct pt_regs* ctx)
{
uint32_t ctr;
bpf_usdt_readarg(1, ctx, &ctr);
bpf_trace_printk("%u\\n", ctr);
return 0;
}
"""
proc = USDT(pid=int(pid))
proc.enable_probe(probe="increment", fn_name="do_trace")
bpf = BPF(text=bpf_text, usdt_contexts=[proc])
print("Attaching to increment probe")
while True:
try:
task, pid, cpu, flags, ts, ctr = bpf.trace_fields()
except ValueError:
print("value error")
except KeyboardInterrupt:
exit()
print("CTR:\t{}".format(ctr.decode('utf=8')))
We can then launch the binary, and attach the probe:
$ ./probe-test &
[1] 841361
$ sudo python trace-probe-test.py 841361
# ...Compiler output elided...
Attaching to increment probe
CTR: 39
CTR: 40
CTR: 41
CTR: 42
CTR: 43
CTR: 44
Under the covers
It's possible to see how the probe is implemented by attaching gdb
to the running process and dissassembling main
.
Here is the instruction stream of main
when no probe is attached:
(gdb) disassemble main
Dump of assembler code for function main:
0x000056532a057135 <+0>: push %rbp
0x000056532a057136 <+1>: mov %rsp,%rbp
0x000056532a057139 <+4>: sub $0x20,%rsp
0x000056532a05713d <+8>: mov %edi,-0x14(%rbp)
0x000056532a057140 <+11>: mov %rsi,-0x20(%rbp)
0x000056532a057144 <+15>: movl $0x0,-0x4(%rbp)
0x000056532a05714b <+22>: addl $0x1,-0x4(%rbp)
0x000056532a05714f <+26>: nop
0x000056532a057150 <+27>: mov $0x1,%edi
0x000056532a057155 <+32>: call 0x56532a057030 <sleep@plt>
0x000056532a05715a <+37>: jmp 0x56532a05714b <main+22>
Notice the nop
instruction at offset 26. When a probe is attached this is replaced by an int3
instruction, which is typically used as a trap for a debugger.
(gdb) disassemble main
Dump of assembler code for function main:
0x000056532a057135 <+0>: push %rbp
0x000056532a057136 <+1>: mov %rsp,%rbp
0x000056532a057139 <+4>: sub $0x20,%rsp
0x000056532a05713d <+8>: mov %edi,-0x14(%rbp)
0x000056532a057140 <+11>: mov %rsi,-0x20(%rbp)
0x000056532a057144 <+15>: movl $0x0,-0x4(%rbp)
0x000056532a05714b <+22>: addl $0x1,-0x4(%rbp)
0x000056532a05714f <+26>: int3
0x000056532a057150 <+27>: mov $0x1,%edi
0x000056532a057155 <+32>: call 0x56532a057030 <sleep@plt>
0x000056532a05715a <+37>: jmp 0x56532a05714b <main+22>
How do we know where to locate the probe point, and if we have many probes, how do we know which is which? The DTRACE_PROBE
macros add notes to the ELF file detailing these locations. First, from objdump
we can see the offset of the probe is 114f
in the .text
section of the ELF file:
$ objdump -S probe-test
...
Disassembly of section .text:
...
int
main(int argc, char* argv[])
{
1135: 55 push %rbp
1136: 48 89 e5 mov %rsp,%rbp
1139: 48 83 ec 20 sub $0x20,%rsp
113d: 89 7d ec mov %edi,-0x14(%rbp)
1140: 48 89 75 e0 mov %rsi,-0x20(%rbp)
uint32_t ctr = 0;
1144: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
while(true)
{
ctr++;
114b: 83 45 fc 01 addl $0x1,-0x4(%rbp)
DTRACE_PROBE1(probe-test, increment, ctr);
114f: 90 nop
sleep(1);
1150: bf 01 00 00 00 mov $0x1,%edi
1155: e8 d6 fe ff ff callq 1030 <sleep@plt>
ctr++;
115a: eb ef jmp 114b <main+0x16>
115c: 0f 1f 40 00 nopl 0x0(%rax)
...
And that the file has a note providing the location of this probe:
$ readelf -n probe-test
...
Displaying notes found in: .note.stapsdt
Owner Data size Description
stapsdt 0x00000038 NT_STAPSDT (SystemTap probe descriptors)
Provider: probe-test
Name: increment
Location: 0x000000000000114f, Base: 0x0000000000002004, Semaphore: 0x0000000000000000
Arguments: 4@-4(%rbp)