I was asked by a client to investigate an issue they observed where the
performance of some hardware became dramatically degraded after a kernel update
on the server this hardware was connected to. This device is an external unit
that communicates with the server using an FTDI usb-serial converter. The
program running on the server that talks to this device is a proprietary binary
blob I'll call the frobnicator
. This blob came from some vendor, and the
client did not have access to sources, or had any real idea about how this
frobnicator
communicates or what it expects.
For years they were using a stock Linux 3.5 kernel shipped by Ubuntu with no issues. But upgrading to a stock 3.13 kernel was apparently causing the device to exhibit glitches, and generally misbehave.
Initial investigation
When asked to look into this, I first manipulated the server in various ways to see how the issue manifests, and what triggers it. I could see that
- The
frobnicator
reports communication integrity (CI) values, and those were noticeably reduced on the newer kernel, which would explain the degraded performance - Raising the system CPU load correlated with higher CI degradation, so a loose
theory could be resource contention, where other tasks are taking resources
away from the
frobnicator
- periodically, errors would appear in the frobnicator log file
/var/log/frobnicator.log
, and these corresponded with periods of poor performance
Now that there was a concrete event that coincided with the poor performance, this could be traced with a tool such as sysdig. The hope was that one would see what happened leading up to the reported error, and investigate that. So; first I recorded a log
$ sudo sysdig -w frobnicator.scap proc.name = frobnicator
Then found instances where it writes complaints into its log:
$ sysdig -r frobnicator.scap fd.name contains frobnicator.log 371858 12:01:43.587694261 0 frobnicator (21343) < open fd=10(<f>/var/log/frobnicator.log) name=frobnicator.log(/var/log/frobnicator.log) flags=15(O_APPEND|O_CREAT|O_RDWR) mode=0 371859 12:01:43.587713415 0 frobnicator (21343) > fstat fd=10(<f>/var/log/frobnicator.log) 371860 12:01:43.587714435 0 frobnicator (21343) < fstat res=0 371863 12:01:43.587748893 0 frobnicator (21343) > write fd=10(<f>/var/log/frobnicator.log) size=50 371864 12:01:43.587779337 0 frobnicator (21343) < write res=51 data=[254542.588] ERROR! OH NO! COULDN'T COMMUNICATE!!! 371865 12:01:43.587780740 0 frobnicator (21343) > close fd=10(<f>/var/log/frobnicator.log) 371866 12:01:43.587781852 0 frobnicator (21343) < close res=0 371872 12:01:43.587824754 0 frobnicator (21343) < open fd=10(<f>/var/log/frobnicator.log) name=frobnicator.log(/var/log/frobnicator.log) flags=15(O_APPEND|O_CREAT|O_RDWR) mode=0 371873 12:01:43.587831903 0 frobnicator (21343) > fstat fd=10(<f>/var/log/frobnicator.log) 371874 12:01:43.587832417 0 frobnicator (21343) < fstat res=0 371877 12:01:43.587838779 0 frobnicator (21343) > write fd=10(<f>/var/log/frobnicator.log) size=48 371878 12:01:43.587843915 0 frobnicator (21343) < write res=40 data=[254542.588] NOOO!! SOMETHING ELSE WENT WRONG!!! 371879 12:01:43.587844635 0 frobnicator (21343) > close fd=10(<f>/var/log/frobnicator.log) 371880 12:01:43.587845018 0 frobnicator (21343) < close res=0
And then looked at what happened leading up to it, searching for anything noteworthy:
$ sysdig -r frobnicator.scap evt.num \>= $((371858-10000)) and evt.num \<= 371858 ... 365555 12:01:43.584116007 0 frobnicator (21343) > write fd=7(<f>/dev/ttyUSB2) size=7 365556 12:01:43.584120946 0 frobnicator (21343) < write res=7 data=....... 365557 12:01:43.584121497 0 frobnicator (21343) > fcntl fd=7(<f>/dev/ttyUSB2) cmd=5(F_SETFL) 365558 12:01:43.584121792 0 frobnicator (21343) < fcntl res=0(<f>/dev/pts/3) 365559 12:01:43.584123089 0 frobnicator (21343) > read fd=7(<f>/dev/ttyUSB2) size=32767 365560 12:01:43.584124530 0 frobnicator (21343) < read res=0 data= 365561 12:01:43.584125440 0 frobnicator (21343) > read fd=7(<f>/dev/ttyUSB2) size=32767 365562 12:01:43.584125967 0 frobnicator (21343) < read res=0 data= 365563 12:01:43.584126441 0 frobnicator (21343) > read fd=7(<f>/dev/ttyUSB2) size=32767 365564 12:01:43.584126830 0 frobnicator (21343) < read res=0 data= ..... lots more hammering read() that returns 0 371853 12:01:43.587643084 0 frobnicator (21343) > read fd=7(<f>/dev/ttyUSB2) size=32767 371854 12:01:43.587643466 0 frobnicator (21343) < read res=0 data= 371855 12:01:43.587651900 0 frobnicator (21343) > stat 371856 12:01:43.587660887 0 frobnicator (21343) < stat res=0 path=frobnicator.log(/var/log/frobnicator.log) 371857 12:01:43.587665751 0 frobnicator (21343) > open 371858 12:01:43.587694261 0 frobnicator (21343) < open fd=10(<f>/var/log/frobnicator.log) name=frobnicator.log(/var/log/frobnicator.log) flags=15(O_APPEND|O_CREAT|O_RDWR) mode=0
So we wrote something to the device at /dev/ttyUSB2
, then hammered the port
waiting for a reply, then gave up reading the port without getting any data, and
complained into the log. (This hammering is embarrassing. Apparently the authors
of frobnicator
have never heard of select
or poll
. This yet again supports
my theory that most proprietary software is proprietary not to protect some
brilliant IP from theft, but rather to avoid embarrassment). So presumably some
timeout expired; the frobnicator
likely expected to get a reply from the
hardware before a certain amount of time elapsed, and this took too long. How
long did it take?
$ echo $((.587651900 - .584120946)) 0.0035309540000000306
So after about 3.5ms, we gave up. The serial device is set at 115200 baud. Let's say you need 10-bits-worth of time to transmit a single byte (this is about right probably because of the start and stop bits). To write 7 bytes and get at least 1 byte back you need
$ echo $(( 8.*10/115200 )) 0.00069444444444444447
0.7ms just for the communication. So I guess it's not crazy to want a reply within 3.5ms. Presumably it doesn't fail this way every time? Do we ever read anything from USB2? Under what circumstances?
$ sysdig -r frobnicator.scap evt.type = read and fd.name contains USB2 and evt.rawres \!= 0 | head -n 10 10190 12:01:41.803108311 1 frobnicator (21343) < read res=1 data=. 16753 12:01:41.834119153 1 frobnicator (21343) < read res=1 data=. 23252 12:01:41.865108212 1 frobnicator (21343) < read res=1 data=. 29817 12:01:41.896112925 1 frobnicator (21343) < read res=1 data=. 42142 12:01:41.959126061 1 frobnicator (21343) < read res=1 data=. 46319 12:01:41.989105762 1 frobnicator (21343) < read res=1 data=. 52241 12:01:42.020106289 3 frobnicator (21343) < read res=1 data=. 58206 12:01:42.051112845 3 frobnicator (21343) < read res=1 data=. 64759 12:01:42.082126350 3 frobnicator (21343) < read res=1 data=. 71562 12:01:42.113106478 3 frobnicator (21343) < read res=1 data=. $ sysdig -r frobnicator.scap evt.type = write and fd.name contains USB2 and evt.num \< 42142 | tail -n1 37288 12:01:41.957115614 1 frobnicator (21343) < write res=7 data=....... $ sysdig -r frobnicator.scap evt.num \>= 37288 and evt.num \<= 42142 37288 12:01:41.957115614 1 frobnicator (21343) < write res=7 data=....... 37289 12:01:41.957116233 1 frobnicator (21343) > fcntl fd=7(<f>/dev/ttyUSB2) cmd=5(F_SETFL) 37290 12:01:41.957116440 1 frobnicator (21343) < fcntl res=0(<f>/dev/pts/3) 37291 12:01:41.957117828 1 frobnicator (21343) > read fd=7(<f>/dev/ttyUSB2) size=32767 37292 12:01:41.957118640 1 frobnicator (21343) < read res=0 data= 37293 12:01:41.957119433 1 frobnicator (21343) > read fd=7(<f>/dev/ttyUSB2) size=32767 37294 12:01:41.957119972 1 frobnicator (21343) < read res=0 data= 37295 12:01:41.957120373 1 frobnicator (21343) > read fd=7(<f>/dev/ttyUSB2) size=32767 37296 12:01:41.957120901 1 frobnicator (21343) < read res=0 data= ... again. we hammer the read() 42133 12:01:41.959120974 1 frobnicator (21343) > read fd=7(<f>/dev/ttyUSB2) size=32767 42134 12:01:41.959121368 1 frobnicator (21343) < read res=0 data= 42135 12:01:41.959121769 1 frobnicator (21343) > read fd=7(<f>/dev/ttyUSB2) size=32767 42136 12:01:41.959122160 1 frobnicator (21343) < read res=0 data= 42137 12:01:41.959122719 1 frobnicator (21343) > read fd=7(<f>/dev/ttyUSB2) size=32767 42138 12:01:41.959123119 1 frobnicator (21343) < read res=0 data= 42139 12:01:41.959123690 1 frobnicator (21343) > read fd=7(<f>/dev/ttyUSB2) size=32767 42140 12:01:41.959124311 1 frobnicator (21343) < read res=0 data= 42141 12:01:41.959124846 1 frobnicator (21343) > read fd=7(<f>/dev/ttyUSB2) size=32767 42142 12:01:41.959126061 1 frobnicator (21343) < read res=1 data=.
So that time, we wrote to /dev/ttyUSB2
, but did receive a response of exactly
one byte. The delay there was
$ echo $((.959126061 - .957115614 )) 0.0020104469999999708
about 2ms. OK. That's reasonable too. So it very well could be that we complain if the response comes after 3.5ms. What's the distribution of these response times?
$ sysdig -t r -r frobnicator.scap fd.name contains frobnicator.log or \ \( fd.name contains ttyUSB2 and \ \( \( evt.type = read and evt.dir = '<' and evt.rawres \!= 0\) or \ \( evt.type = write and evt.dir = '>' \) \) \) | \ perl -anE 'if(/write.*USB2/) { $t0 = $F[1]; } elsif(defined $t0 && /read|frobnicator.log/ ) { $d = $F[1] - $t0; say $d*1000; $t0 = undef; }' | \ feedgnuplot --histo 0 --binwidth 0.1 --with boxes --set 'style fill solid border -1' \ --title "Histogram of /dev/ttyUSB2 reply times" --xlabel "Reply (ms)" -ylabel Count
OK, so we look at the time interval between a write() to /dev/ttyUSB2
and
either when we read() something from it, or touch frobnicator.log
in any way.
Looks like the usual time is 1.5ms or so, and we give up at 3.5ms, as expected:
I did the same thing on a working machine running the known-good 3.5 kernel: no
frobnicator.log
error messages were observed and a response histogram made
this way is clean, and does not ever break 3.5ms. So this is consistent with the
theory that these 3.5ms timeouts are involved in the poor performance we are
seeing.
I should say that we are running vanilla Linux kernel from Ubuntu. It appears
that the frobnicator
is expecting realtime performance from this kernel, but
this realtime performance was never guaranteed. So it's unlikely that the
kernel is doing anything more "wrong" now than it was before, and the
frobnicator
just has unreasonable expectations. Anyway…
Initial kernel tracing
We see that an expected response from the USB device does not arrive, or arrives
too late. Many things could be causing this. I was hoping this is purely a
software issue, so I dug into the kernel. This isn't something that sysdig
can
do, so I switched to perf
here. I placed lots of probes into the path from the
time when we write() to /dev/ttyUSB2
to the time when we either receive a
response or give up. Eventually I ended up with this recordall
script:
#!/bin/bash set -e VMLINUX=/usr/lib/debug/boot/vmlinux-3.13.0-58-generic USBSERIALDIR=/usr/lib/debug/lib/modules/3.13.0-58-generic/kernel/drivers/usb/serial USBSERIAL=$USBSERIALDIR/usbserial.ko FTDI=$USBSERIALDIR/ftdi_sio.ko SOURCE=/chattel/linux sudo perf probe --del '*' || true # let this fail sudo perf probe -k $VMLINUX -s $SOURCE -m $FTDI --add 'ftdi_process_read_urb urb->actual_length urb->status urb->transfer_buffer_length' sudo perf probe -k $VMLINUX -s $SOURCE -m $FTDI --add 'ftdi_process_packet len' sudo perf probe -k $VMLINUX -s $SOURCE --add 'tty_insert_flip_string_fixed_flag flag size' sudo perf probe -k $VMLINUX -s $SOURCE --add 'tty_schedule_flip' sudo perf probe -k $VMLINUX -s $SOURCE --add 'tty_schedule_flip_ret=tty_schedule_flip:6 %ax' sudo perf probe -k $VMLINUX -s $SOURCE --add 'flush_to_ldisc' sudo perf probe -k $VMLINUX -s $SOURCE --add 'receive_buf' sudo perf probe -k $VMLINUX -s $SOURCE --add 'n_tty_receive_buf2' sudo perf probe -k $VMLINUX -s $SOURCE --add 'n_tty_receive_buf_common' sudo perf probe -k $VMLINUX -s $SOURCE --add 'copy_from_read_buf' sudo perf probe -k $VMLINUX -s $SOURCE --add tty_flush_to_ldisc sudo perf record \ -esched:sched_{kthread_stop,kthread_stop_ret,wakeup,wakeup_new,switch,migrate_task,process_free,process_exit,wait_task,process_wait,process_fork,process_exec,stat_wait,stat_sleep,stat_iowait,stat_blocked,stat_runtime,pi_setprio,process_hang} \ -eirq:irq_handler_{entry,exit} \ -eirq:softirq_{entry,exit} \ -e cs \ -e syscalls:sys_enter_write --filter "fd==7" \ -e syscalls:sys_enter_read --filter "fd==7" \ -e syscalls:sys_enter_ppoll \ -e syscalls:sys_exit_ppoll \ -e syscalls:sys_exit_read --filter "ret==0 || ret==1" \ -eprobe:ftdi_process_read_urb \ -eprobe:ftdi_process_packet \ -eprobe:tty_insert_flip_string_fixed_flag \ -eprobe:tty_schedule_flip{,_ret} \ -eprobe:copy_from_read_buf \ -eprobe:{flush_to_ldisc,receive_buf{,_1}} \ -eprobe:n_tty_receive_buf{2,_common} \ -eprobe:tty_flush_to_ldisc \ -eworkqueue:workqueue_{queue_work,activate_work,execute_start,execute_end} \ -Ra sleep 1; sudo chown user:user perf.data perf script | ./script_filter.pl > /tmp/`uname -n`.fg
This assumes that kernel debug symbols are available, which is one area where
Ubuntu is actually better than Debian:
https://wiki.ubuntu.com/Debug%20Symbol%20Packages. Note that the probes look at
read() and write(), some USB stuff, some TTY stuff and some process scheduling
stuff. I was adding probes one at a time as I was chasing the failure, and this
is what was there at the end; I didn't just decided to probe this set of things
from the start. After we set up the dynamic tracepoints with perf probe
, we
record 1 second of trace data with perf record
; this writes a binary
perf.data
file, that can be read by perf report
(to get statistics) and
perf script
(to look at each individual trace).
Once we have a trace, a visualization script is used to convert it into
something that feedgnuplot
can visualize. Modulo small visualization tweaks,
script_filter.pl
looks like this:
#!/usr/bin/perl use warnings; use feature ':5.10'; $i = 0; my $machine = `uname -n`; chomp $machine; say "#!/usr/bin/feedgnuplot --dom --datai --auto --with 'points ps 2' --style cpu 'with labels' --rangesize cpu 2 --set 'key Left'"; my $arg = $ARGV[0]; $print_orig_if = qr/$arg/ if defined $arg; while(<STDIN>) { chomp; next unless /^ /; my @F = split; ($proc,$cpu,$t,$id, $arg) = @F[0,2,3,4,5]; $cpu =~ s/\[(.*?)\]/$1/g; $cpu = 0 + $cpu; $id =~ s/probe//; $id =~ s/syscalls://; $id =~ s/workqueue://; $id =~ s/irq://; $id =~ s/sched://; $id =~ s/://g; $t =~ s/://g; next if $id =~ /^sys_/ && $proc ne 'frobnicator'; # id mangling if ( $id eq 'sys_exit_read' ) { $id .= ";ret=$arg"; } my ($post) = /$id(.*)/; if ($post) { my @args = $post =~ /[0-9a-zA-Z_]+ *= *[0-9a-zx\.]+/g; for my $arg (@args) { $arg =~ s/[ \.]//g; $id .= ";$arg"; } } if ( $id eq 'sys_exit_read;ret=0x0' && $proc eq 'frobnicator') { if ($last_exit_read_0) { $t0 = undef; } else { $last_exit_read_0 = 1; } } elsif ( $id !~ /n_tty_read|r49|tty_put_user|read_/ and $proc eq 'frobnicator') { $last_exit_read_0 = 0; } if (/sys_enter_write.*count: [0x]*7$/) { $t0 = $t; $i++; } if ( defined $t0 ) { $d = ($t-$t0) / 1e-3; if( $id !~ /^ftdi|tty|workqueue_queue_work|workqueue_activate_work|irq/ && $proc !~ /frobnicator|swapper/ ) { $id .= ";proc=$proc"; } say "$i $id $d"; my $icpu = $i + 0.1 + $cpu/50; $cpu .= ";$proc" if $id =~/actual_length=3/; say "$icpu cpu $d $cpu"; say STDERR "processed: '$i(t0=$t0) $id $d', orig line: $_" if (defined $print_orig_if && $id =~ $print_orig_if); } if( $id eq 'sys_exit_read;ret=0x1' && $proc eq 'frobnicator') { $t0 = undef; } }
This isn't nicely written at all; it's just a one-off made for this
investigation, and is a one-liner that got out of hand. The tracing procedure
was to run ./recordall
on a machine while it's exhibiting the troublesome
behavior, then execute the self-plotting data file it creates:
/tmp/MACHINE.fg
. Initially, the results look like this:
(raster plot because of a very high point count). The x-axis is each attempt
to write to /dev/ttyUSB2
and to get a response from it. The y-axis is time, in
ms. The initial write(/dev/ttyUSB2)
is at t=0, at the bottom of each stack.
Each type of event is plotted with a different symbol. Here I omit the legend
because there're way too many different symbols, and too much stuff going on.
Each sequence ends when we either received data from /dev/ttyUSB2
or if we
gave up. Giving up means we stop calling read() on the port. Even with this
super busy plot we see that most cycles complete after 1-2ms, with a few ending
a bit after 3.5ms, as we have seen previously. These cycles are the problem. The
plot looks like a solid wall of events primarily because of the hammering read()
that's happening: we constantly try to read() the data, and get 0 bytes back.
Much more can be inferred if we focus on various sections of this data.
If we make the same plot, but focus only on the FTDI stuff, we get this:
OK. Much less going on. FTDI USB packets are supposed to come in every 1ms, and
we can see this: ftdi_process_read_urb
runs in response to the USB interrupt,
and we see stuff happen every 1ms or so. Each FTDI USB packet is supposed to
contain 2 bytes of status, followed by a payload. Thus to get our 1-byte payload
into read(), we must get a USB packet that's 3 bytes long. The great news here
is that we consistently see 3-byte FTDI payloads come in when they're supposed
to, even on cycles where the 3.5ms deadline was missed. For instance, the plot
above showed that cycles 11, 18 and 19 each missed the 3.5ms deadline, but in
this plot we see a 3-byte FTDI packet (blue star, orange square) arrive in time
(at about 2.1ms, 1.95ms and 1.9ms respectively). This means that the hardware
is receiving the data in time, so no need to check the write() latencies,
think about buffering issues, or trace stuff in hardware. Whew. Let's see what
success looks like. If we zoom in on the end of the first stack, we get this:
Here I show the events and which CPU they were executing on. Again, it took a
while to settle on this particular data to plot. As before, we see that the
application repeatedly calls read() which returns with 0: it's asking for data
we do not yet have. In the meantime, we see the 3-byte FTDI packet come in
(ftdi_process_read_urb
with actual_length=3
). This results in calls to
various functions, eventually making it to the TTY subsystem:
tty_insert_flip_string_fixed_flag
. We tty_schedule_flip
for later
processing; this calls workqueue_queue_work
and workqueue_activate_work
.
Then after a bit of time we see this deferred task run:
workqueue_execute_start
runs with the payload function: flush_to_ldisc
. This
does more stuff, and eventually we see read() return 1. To clarify, the
corresponding raw events coming out of perf script
are:
swapper 0 [000] 93115.013218: probe:ftdi_process_read_urb: (ffffffffa05868c0) actual_length=3 status=0 transfer_buffer_length=200 swapper 0 [000] 93115.013219: probe:ftdi_process_packet: (ffffffffa05865a0) len=3 swapper 0 [000] 93115.013220: probe:tty_insert_flip_string_fixed_flag: (ffffffff814723a0) flag=0 size=1 swapper 0 [000] 93115.013221: probe:tty_schedule_flip: (ffffffff81472090) swapper 0 [000] 93115.013222: workqueue:workqueue_queue_work: work struct=0xffff88081e452810 function=flush_to_ldisc workqueue=0xffff88082ec0ac00 req_cpu=256 cpu=0 swapper 0 [000] 93115.013223: workqueue:workqueue_activate_work: work struct 0xffff88081e452810 swapper 0 [000] 93115.013227: probe:tty_schedule_flip_ret: (ffffffff814720bb) arg1=1 kworker/0:1 142 [000] 93115.013234: workqueue:workqueue_execute_start: work struct 0xffff88081e452810: function flush_to_ldisc kworker/0:1 142 [000] 93115.013235: probe:flush_to_ldisc: (ffffffff81472500) kworker/0:1 142 [000] 93115.013236: probe:receive_buf: (ffffffff814725b0) kworker/0:1 142 [000] 93115.013237: probe:n_tty_receive_buf2: (ffffffff8146efa0) kworker/0:1 142 [000] 93115.013238: probe:n_tty_receive_buf_common: (ffffffff8146edc0) kworker/0:1 142 [000] 93115.013239: workqueue:workqueue_execute_end: work struct 0xffff88081e452810 frobnicator 24483 [001] 93115.013238: syscalls:sys_enter_read: fd: 0x00000007, buf: 0x01f417c4, count: 0x00007fff frobnicator 24483 [001] 93115.013240: probe:copy_from_read_buf: (ffffffff8146c1f0) frobnicator 24483 [001] 93115.013241: probe:copy_from_read_buf: (ffffffff8146c1f0) frobnicator 24483 [001] 93115.013242: syscalls:sys_exit_read: 0x1
That makes sense. What about a failing case? Previously we saw cycle 11 time out at 3.5ms. We do see a 3-byte USB packet arrive, so what's different about the sequence in cycle 11 that makes things break, compared to cycle 1 where things work? The events look like this:
This generally looks similar to what we saw before. The difference is that while
the deferred task is scheduled by workqueue_queue_work
and
workqueue_activate_work
, it does not run in time. This is very consistent in
all the traces I've been looking at, and we have found our failure mechanism.
CPU allocation and the scheduler
Now that we know that process scheduling is somehow responsible, let's look at the success and failure plots again, and pay attention to what the scheduler is doing and what the various CPUs are doing.
In the success shown above, we see that on CPU 1 frobnicator
is doing its
hammering read(), again and again. In the meantime, CPU 0 is servicing an
interrupt:
- IRQ 16 fires. It's apparently handling both USB (ehci) and ethernet (eth0).
- Both of these are handled
- then the USB is further processed by the soft IRQ handler, which (eventually)
calls
ftdi_process_read_urb
. The events indicate that the swapper process is what CPU 0 was doing before it was interrupted. This is an internal kernel idle process: i.e. the CPU was not doing anything. - As before we see the scheduling of the deferred task:
workqueue_activate_work
. - When this task is scheduled we see some scheduler events:
sched_stat_sleep
andsched_wakeup
- Then the soft IRQ handler exits, and the scheduler runs to decide what CPU 0
should do next; we see more scheduler events:
sched_stat_wait
andsched_switch
- The scheduler makes the deferred task run:
workqueue_execute_start
in thekworker/0:1
kernel process. The0
in the process name means CPU 0, so the CPU handling the interrupt is the same as the CPU handling the deferred task. I see this consistently even as I change the CPU servicing the interrupt
The raw events for CPU 0 corresponding to the successful task switch are:
swapper 0 [000] 93115.013222: workqueue:workqueue_queue_work: work struct=0xffff88081e452810 function=flush_to_ldisc workqueue=0xffff88082ec0ac00 req_cpu=256 cpu=0 swapper 0 [000] 93115.013223: workqueue:workqueue_activate_work: work struct 0xffff88081e452810 swapper 0 [000] 93115.013224: sched:sched_stat_sleep: comm=kworker/0:1 pid=142 delay=865313 [ns] swapper 0 [000] 93115.013225: sched:sched_wakeup: comm=kworker/0:1 pid=142 prio=120 success=1 target_cpu=000 swapper 0 [000] 93115.013227: probe:tty_schedule_flip_ret: (ffffffff814720bb) arg1=1 swapper 0 [000] 93115.013229: irq:softirq_exit: vec=6 [action=TASKLET] swapper 0 [000] 93115.013231: sched:sched_stat_wait: comm=kworker/0:1 pid=142 delay=0 [ns] swapper 0 [000] 93115.013232: sched:sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:1 next_pid=142 next_prio=120 kworker/0:1 142 [000] 93115.013234: workqueue:workqueue_execute_start: work struct 0xffff88081e452810: function flush_to_ldisc
sched_stat_wait
says delay=0
meaning that the kworker/0:1
tasks did not
wait at all to be scheduled. As soon as it was ready to run, it was allowed to.
Great.
OK. What about the failure case? The raw events look like this:
frobnicator 24483 [002] 93115.322925: syscalls:sys_enter_write: fd: 0x00000007, buf: 0x01f15cf4, count: 0x00000007 ... perf 27484 [000] 93115.324019: workqueue:workqueue_queue_work: work struct=0xffff88081e452810 function=flush_to_ldisc workqueue=0xffff88082ec0ac00 req_cpu=256 cpu=0 perf 27484 [000] 93115.324020: workqueue:workqueue_activate_work: work struct 0xffff88081e452810 perf 27484 [000] 93115.324022: sched:sched_stat_sleep: comm=kworker/0:1 pid=142 delay=992381 [ns] perf 27484 [000] 93115.324023: sched:sched_wakeup: comm=kworker/0:1 pid=142 prio=120 success=1 target_cpu=000 perf 27484 [000] 93115.324024: probe:tty_schedule_flip_ret: (ffffffff814720bb) arg1=1 perf 27484 [000] 93115.324027: irq:softirq_exit: vec=6 [action=TASKLET] ... CPU 0 services the USB interrupt many more times. None of the scheduled tasks run until... perf 27484 [000] 93115.326706: sched:sched_stat_runtime: comm=perf pid=27484 runtime=2856449 [ns] vruntime=482196990 [ns] perf 27484 [000] 93115.326708: sched:sched_stat_wait: comm=kworker/0:1 pid=142 delay=2683863 [ns] perf 27484 [000] 93115.326709: sched:sched_switch: prev_comm=perf prev_pid=27484 prev_prio=120 prev_state=S ==> next_comm=kworker/0:1 next_pid=142 next_prio=120 perf 27484 [000] 93115.326710: cs: ffffffff8176300a [unknown] ([kernel.kallsyms]) kworker/0:1 142 [000] 93115.326712: workqueue:workqueue_execute_start: work struct 0xffff88081e452810: function flush_to_ldisc
That time the hammering read() is on CPU 2 and once again the interrupt is
serviced by CPU 0. When it was interrupted, CPU 0 was not idle, however: it was
running the perf
process (used to gather the data we're looking at). Here, the
deferred scheduling happens as in the successful case; we see sched_stat_sleep
and sched_wakeup
, but the scheduled process doesn't run, at least not
immediately.
It is significant that the CPU being interrupted was idle in the successful
case: it had nothing else to do other than to service our interrupt. Here it
did have something to do: when the IRQ handler exited, the CPU went back to
perf
. Eventually perf
releases the CPU and our task can run. Here the cycle
started at t=93115.322925, the reply was received and the task was scheduled at
t=93115.324019 (1.1ms later). However the CPU wasn't released until
t=93115.326712, which is 3.8ms after the start, and past our 3.5ms deadline.
sched_stat_wait
even tells us that the task was sitting there for 2.68ms
waiting for the CPU.
This scenario is consistent in all the traces I collected. The interrupt and the deferred tasks are handled by one particular CPU, while userspace applications are free to use any CPU they want. If our USB data happens to be delivered at a time when some user process is using the interrupt-servicing CPU, then contention and delays can result. If we increase the system CPU load, then the machine is spending more time crunching its userspace work, and interrupts are more likely to fire at an inopportune time. I.e. this is all very consistent with the observed behavior.
This feels like a fairly run-of-the-mill issue that shouldn't have been exposed
by a kernel upgrade. Why did this work on the older Linux 3.5 boxes at all? It
appears that the machines that the client is running Linux 3.13 on service the
USB interrupt on CPU 0, and this CPU may be used by any user process also. For
whatever reason, when these machines boot into Linux 3.5 instead, the USB
interrupt is serviced on CPU 12, and in practice this core is only used by the
frobnicator
process. This mapping isn't enforced by any setting anywhere, but
it is what I'm observing. Perhaps there's some heuristic to put user processes
on the same CPU core that services most of the interrupts delivering that
process's data. While waiting for data, the frobnicator
process spends all its
time calling read(), which in turn goes directly into the kernel. This hammering
read() behavior likely doesn't take priority over the workqueue in the kernel,
and our data can be delivered in a timely fashion. So it appears that on the
older kernel there was some sort of de-facto CPU segregation going on that
allowed the workqueue tasks to run on an uncontested CPU. I.e. we were getting
lucky.
Mitigation
We have a CPU contention issue. Potential fixes:
- Make sure the CPU servicing the interrupt is never busy when an interrupt fires
- Set up the process priorities to prioritize the workqueue above other user processes
Segregation
The older kernel effectively implements the first option above: the CPU
servicing the interrupt is never doing anything intensive, and can turn around
our data quickly. We can mimic this by using the taskset
command to explicitly
set the CPU affinities of each process on the machine to stay off the CPU
servicing the USB interrupt. This CPU is controlled by the
/proc/irq/N/smp_affinity_list
file where N
is the IRQ number (16 on the box
the traces came from). As expected, this resolves the frobnicator
performance
issues. This however is quite a big hammer: a whole CPU core becomes unavailable
for any other system work. So it's a good validation of the debugging so far,
but I want a better solution
Prioritization
As seen before, during a failure we have:
- CPU is running process
x
- USB interrupt comes in
- Interrupt schedules deferred task
- Interrupt exits, scheduler runs, CPU goes back to
x
We want the scheduler to make a different choice: instead of going back to the
process that was running previously, it should service our workqueue first.
This can be done by bumping the priority of the process servicing the workqueue.
The tricky thing is that this process is not the frobnicator
, but a
kworker/a:b
process. As seen from the traces, on that box this process was
kworker/0:1
. Setting a very high realtime priority to that process makes the
problem go away also:
$ chrt -p -r 99 $(pgrep kworker/0:1)
Something like this is a much better solution, since it doesn't sacrifice any resources, and simply prioritizes the work that has tighter timing requirements than the other tasks.
This exact command isn't a perfect final solution either, however. I don't like
touching the priority of an internal kernel process, since a later kernel change
could break the way this works. This change will have the desired effect only
if the kworker
process does not change. It very well could change if the
smp_affinity_list
contains multiple CPUs, or if something touches that file.
And the kworker
doesn't necessary have to run on the same CPU as the
interrupt, it just happens that way; a later kernel could change that. I don't
see any kernel options to control this stuff in a better way, however. Linux 3.9
introduced higher-priority kworkers (kworker/0:1H
for instance), but as far as
I can tell I'd need to patch the kernel to make the FTDI stack use them. That's
inherently maintenance-full. So in the end the recommendation was to
- disable the
irqbalance
daemon, which was setting thesmp_affinity_list
as the machine was running - set the
smp_affinity_list
to an arbitrary, single CPU - raise the priority of the corresponding
kworker
process
This seems to work. More system testing is required in case irqbalance
was
actually doing something useful that people weren't aware of (this comes by
default on Ubuntu, and it's not obvious if it's a net positive or not). Any
kernel update could break something, and more testing would be required, but
that was probably assumed from the start anyway. OK. Done.