Follow-up posts
I'm continuing to look for and plug memory leaks. The previous effort successfully plugged a leak, but there's much more to find here. I found some more stuff, which is documented here. The corresponding bug report is
http://debbugs.gnu.org/cgi/bugreport.cgi?bug=21556
Preliminaries
I'm instrumenting all allocation/deallocation functions, and analyzing the data
with some simple (and not-yet-nice) tools to look for anomalies. I need to
generate backtraces, so I use perf
as the main instrumentation tool. Generally
backtraces are computed from frame pointers, but all -O
optimization levels in
gcc
strip those out, so I build my own emacs with -fno-omit-frame-pointer
to
keep them in place. It is possible to make backtraces with perf
even without
frame pointers (using DWARF debug information), but this results in much larger
trace files. This extra load generally overwhelms my machine, and important data
ends up culled at data-collection time.
I'm using the lucid X11 widgets for my emacs.
I'm on Debian/sid, so debug symbols are generally split off into separate
packages. It appears that perf
has a bug, and needs a patch to find these
properly. I did that and posted on LKML:
http://lkml.iu.edu/hypermail/linux/kernel/1509.0/04006.html
Test configuration
I set up emacs to continually leak memory. Here I do that by running
emacs -Q --daemon while true; do timeout 1 emacsclient -a '' -c; sleep 1; done
This is similar to the setup in the previous post, but I create one client frame at a time instead of 10 at a time. The sequence is 1-second-with-frame followed by 1-second-without-frame; a new frame pops up every 2 seconds. This appears to be sufficient as the memory leaks fairly quickly. Note that there are at least two different leaks here. One of the leaks results from an X error:
http://debbugs.gnu.org/cgi/bugreport.cgi?bug=21509
Here I do not chase that leak, but look for the non-X-error-caused leaks. I sample the memory consumption with this zsh snippet:
while (true) { ps -h -p `pidof emacs` -O rss; sleep 1 } | awk '{print $2;}'
and it looks like this:
This leak is about 90kB/s or 180kB/frame! As this runs, I make a record of all
memory allocation operations. I use perf
to do this:
perf probe --del '*' perf probe -x /lib/x86_64-linux-gnu/libc-2.19.so --add 'malloc=malloc bytes' perf probe -x /lib/x86_64-linux-gnu/libc-2.19.so --add 'malloc_ret=malloc%return $retval' perf probe -x /lib/x86_64-linux-gnu/libc-2.19.so --add 'calloc=calloc elem_size n' perf probe -x /lib/x86_64-linux-gnu/libc-2.19.so --add 'calloc_ret=calloc%return $retval' perf probe -x /lib/x86_64-linux-gnu/libc-2.19.so --add 'realloc=realloc oldmem bytes' perf probe -x /lib/x86_64-linux-gnu/libc-2.19.so --add 'realloc_ret=realloc%return $retval' perf probe -x /lib/x86_64-linux-gnu/libc-2.19.so --add 'aligned_alloc alignment bytes' perf probe -x /lib/x86_64-linux-gnu/libc-2.19.so --add 'aligned_alloc_ret=aligned_alloc%return $retval' perf probe -x /lib/x86_64-linux-gnu/libc-2.19.so --add 'posix_memalign alignment size' perf probe -x /lib/x86_64-linux-gnu/libc-2.19.so --add 'posix_memalign_ret=posix_memalign%return $retval' perf probe -x /lib/x86_64-linux-gnu/libc-2.19.so --add 'free mem' timeout 40 perf record -m512 -r50 -g --call-graph=fp -p `pidof emacs` \ -eprobe_libc:{free,{malloc,calloc,realloc}{,_ret}} \ -eprobe_libc:aligned_alloc{,_1,_ret} \ -eprobe_libc:posix_memalign{,_ret}
So I probe all the memory allocation/deallocation operations and log all
inputs/outputs. This creates a (very large) perf.data
file. Backtraces are
available if frame pointers are available, so a malloc()
called directly from
emacs has a backtrace (because I built with -fno-omit-frame-pointer
). However
malloc()
that goes through something like libgtk
first does not. perf
can
get overloaded and not write all the data, and I make sure to only keep full
data sets.
I perf record
for 40 seconds. Since I make a new frame every 2 seconds, about
20 new clients were created in that time.
I now convert the log file to a human-readable one:
perf script > script
This makes a human-readable, but even larger file.
Analysis
I now run the script
file through a parse_script.pl
tool to follow all
allocations, and report any that have not been freed:
#!/usr/bin/perl use strict; use warnings; use feature 'say'; my $Nbytes_allocated = 0; my %allocated; my ($prev_addr, $prev_ret, $prev_type, $prev_realloc0, $prev_realloc0_addr, $prev_realloc0_bytes); my $allocating; while(<>) { next unless /probe_libc:([^:]+)/; my $type = $1; my $ret = $type =~ /_ret$/; $type =~ s/_ret$//; if ( $ret && !( !$prev_ret && $type eq $prev_type) && !($prev_realloc0 && $prev_type eq 'malloc' && $prev_ret && $type eq 'realloc') ) { die "$type ret, but prev wasn't a corresponding !ret"; } elsif ( !$ret && !$prev_ret && !($prev_realloc0 && $prev_type eq 'realloc' && !$prev_ret && $type eq 'malloc') && $. > 1) { die "$type !ret following another !ret"; } elsif ( $prev_realloc0 && !($type eq 'malloc' || $type eq 'realloc')) { die "realloc(0, N) must be followed by malloc(N)"; } elsif ( !$ret ) { if ($type eq 'malloc' && /bytes=([0-9a-z]+)/) { $allocating = hex $1; if ( $prev_realloc0 && $allocating != $prev_realloc0_bytes ) { die "realloc(0, N) must be followed by malloc(N)"; } } elsif ($type eq 'calloc' && /elem_size=([0-9a-z]+).*n=([0-9a-z]+)/) { $allocating = (hex $1) * (hex $2); } elsif ($type eq 'aligned_alloc' && /bytes=([0-9a-z]+)/) { $allocating = hex $1; } elsif ($type eq 'realloc' && /oldmem=([0-9a-z]+).*bytes=([0-9a-z]+)/) { if ( hex($1) == 0 ) { # realloc(0, xxx) is always mapped to a malloc apparently. I treat # this specially $prev_realloc0 = 1; $prev_realloc0_bytes = hex $2; } else { $allocating = hex $2; $prev_addr = $1; } } elsif ($type eq 'free' && /mem=([0-9a-z]+)/) { if ( hex($1) != 0) # free(0) does nothing { if (!defined $allocated{$1}) { say "Unallocated free at $1. Line $."; } else { $Nbytes_allocated -= $allocated{$1}{bytes}; delete $allocated{$1}; } } $ret = 1; # free has no free-ret so I set that now } else { say "Unknown !ret line: '$_'"; exit; } } elsif ( $ret ) { if ( !/arg1=([0-9a-z]+)/ ) { die "Ret didn't get arg1"; } my $addr = $1; if ( hex($addr) == 0 ) { say "$type returned NULL. Giving up"; exit; } elsif ( $type =~ /^(?:[cm]alloc|aligned_alloc)$/ ) { if (defined $allocated{$addr}) { say "Double alloc at $addr. Line $."; } else { $allocated{$addr}{bytes} = $allocating; $allocated{$addr}{line} = $.; $Nbytes_allocated += $allocating; } if ( $prev_realloc0 && $type eq 'malloc') { $prev_realloc0_addr = $addr; } } elsif ( $type eq 'realloc' ) { if ( $prev_realloc0 ) { if ( $addr ne $prev_realloc0_addr ) { die "realloc(0, N) must be followed by malloc(N); differing addr"; } $prev_realloc0 = undef; $prev_realloc0_addr = undef; $prev_realloc0_bytes = undef; } else { my $prev0 = (hex($prev_addr) == 0); if (!$prev0 && !defined $allocated{$prev_addr}) { say "realloc not alloced at $prev_addr. Line $."; $prev0 = 1; } if ($addr ne $prev_addr && defined $allocated{$addr}) { say "Double realloc at $addr. Line $."; } if ( !$prev0 ) { $Nbytes_allocated -= $allocated{$prev_addr}{bytes}; delete $allocated{$prev_addr}; } $allocated{$addr}{bytes} = $allocating; $allocated{$addr}{line} = $.; $Nbytes_allocated += $allocating; } } else { say "Unknown ret line: '$_'"; exit; } $allocating = undef; } $prev_type = $type; $prev_ret = $ret; } $Nbytes_allocated /= 1e6; say "Total allocated: $Nbytes_allocated MB"; say ''; for my $addr ( sort { $allocated{$a}{line} <=> $allocated{$b}{line}} keys %allocated ) { my ($bytes,$line) = ($allocated{$addr}{bytes}, $allocated{$addr}{line}); say "Leaked " . sprintf('%5d', $bytes) . " bytes at line $line ($addr)"; }
Allocations reported as leaky by this tool aren't necessarily leaky; I sampled for a finite amount of time, and maybe the memory hasn't been freed yet. However anything that has been allocated near the beginning of the log file, and hasn't been freed by the end is potentially a leak, and should be investigated. I make a plot of leak size vs line number:
<leaks awk '$1=="Leaked" {print $6,$2}' | feedgnuplot --domain --points --xlabel 'script line number' --ylabel 'size leaked'
and the result looks like this:
Note that it looks like we leak exactly 60984 bytes (can see that when zoom in) at regular intervals, about 20 times. There are 20 frames created in this log, so this is a strong indication that whatever happens with those allocations happens every time we create (or destroy) a client frame. Looking at the backtrace for those allocations, we see that they're all in
b62c XftFontOpenInfo (/usr/lib/x86_64-linux-gnu/libXft.so.2.3.2)
So we're leaking a font! Those are large, and plugging this leak would be very good. This was called through a library so no backtrace was available. I trace this again looking just at these allocations, with full DWARF backtrace tracking:
perf probe -x /usr/lib/x86_64-linux-gnu/libXft.so.2 --add XftFontOpenInfo timeout 40 perf record -m512 -r50 -g --call-graph=dwarf -p `pidof emacs` -eprobe_libc:malloc --filter bytes==60984
Backtrace:
emacs-snapshot- 17249 [001] 681364.216285: probe_libc:malloc: (7fbff4f01020) bytes=0xee38 7fbff4f01020 malloc (/lib/x86_64-linux-gnu/libc-2.19.so) 7fbff9b9d62c XftFontOpenInfo (/usr/lib/x86_64-linux-gnu/libXft.so.2.3.2) 7fbff9b9e4da XftFontOpenPattern (/usr/lib/x86_64-linux-gnu/libXft.so.2.3.2) 5c46b5 xftfont_open (/usr/bin/emacs-snapshot-lucid) 5767e8 font_open_entity (/usr/bin/emacs-snapshot-lucid) 576e5c font_load_for_lface (/usr/bin/emacs-snapshot-lucid) 57703a font_open_by_spec (/usr/bin/emacs-snapshot-lucid) 577095 font_open_by_name (/usr/bin/emacs-snapshot-lucid) 4d0390 x_default_font_parameter (/usr/bin/emacs-snapshot-lucid) 4d7f98 Fx_create_frame (/usr/bin/emacs-snapshot-lucid) 55f933 Ffuncall (/usr/bin/emacs-snapshot-lucid) 595823 exec_byte_code (/usr/bin/emacs-snapshot-lucid) 55f350 funcall_lambda (/usr/bin/emacs-snapshot-lucid) 55f72b Ffuncall (/usr/bin/emacs-snapshot-lucid) 595823 exec_byte_code (/usr/bin/emacs-snapshot-lucid) 55f72b Ffuncall (/usr/bin/emacs-snapshot-lucid) 560d13 Fapply (/usr/bin/emacs-snapshot-lucid) 55f831 Ffuncall (/usr/bin/emacs-snapshot-lucid) 595823 exec_byte_code (/usr/bin/emacs-snapshot-lucid) 55f72b Ffuncall (/usr/bin/emacs-snapshot-lucid) 595823 exec_byte_code (/usr/bin/emacs-snapshot-lucid) 55f72b Ffuncall (/usr/bin/emacs-snapshot-lucid) 595823 exec_byte_code (/usr/bin/emacs-snapshot-lucid) 55f72b Ffuncall (/usr/bin/emacs-snapshot-lucid) 595823 exec_byte_code (/usr/bin/emacs-snapshot-lucid) 55f72b Ffuncall (/usr/bin/emacs-snapshot-lucid) 595823 exec_byte_code (/usr/bin/emacs-snapshot-lucid) 55f72b Ffuncall (/usr/bin/emacs-snapshot-lucid) 560bb0 Fapply (/usr/bin/emacs-snapshot-lucid) 560d8a apply1 (/usr/bin/emacs-snapshot-lucid) 55df4a internal_condition_case_1 (/usr/bin/emacs-snapshot-lucid) 5992e0 read_process_output (/usr/bin/emacs-snapshot-lucid)
So this is the leaky allocation path. What's the "normal" free
path that's not
happening? At this point I've played with this enough to notice that these
issues only come up for the first frame. If I create a frame, leave it open, and
then use my loop to create/destroy the second frame, then this leak does not
occur. Probing that situation I know what the proper free
path is:
7fbd19a85660 free (/lib/x86_64-linux-gnu/libc-2.19.so) 7fbd1e730af4 XftFontManageMemory (/usr/lib/x86_64-linux-gnu/libXft.so.2.3.2) 7fbd1e72cab7 _XftCloseDisplay (/usr/lib/x86_64-linux-gnu/libXft.so.2.3.2) 7fbd1eb64e22 XCloseDisplay (/usr/lib/x86_64-linux-gnu/libX11.so.6.3.0) 7fbd1f4dd3d5 [unknown] (/usr/lib/x86_64-linux-gnu/libXt.so.6.0.0) 7fbd1f4ddffa XtCloseDisplay (/usr/lib/x86_64-linux-gnu/libXt.so.6.0.0) 4c44ce x_delete_terminal (/usr/bin/emacs-snapshot-lucid) 4b604b Fdelete_terminal (/usr/bin/emacs-snapshot-lucid) 423397 delete_frame (/usr/bin/emacs-snapshot-lucid) 55f928 Ffuncall (/usr/bin/emacs-snapshot-lucid) 595823 exec_byte_code (/usr/bin/emacs-snapshot-lucid) 55f72b Ffuncall (/usr/bin/emacs-snapshot-lucid) 595823 exec_byte_code (/usr/bin/emacs-snapshot-lucid) 55f72b Ffuncall (/usr/bin/emacs-snapshot-lucid) 560bb0 Fapply (/usr/bin/emacs-snapshot-lucid) 560d8a apply1 (/usr/bin/emacs-snapshot-lucid) 55df4a internal_condition_case_1 (/usr/bin/emacs-snapshot-lucid) 5995cf exec_sentinel (/usr/bin/emacs-snapshot-lucid) 59b1d9 status_notify (/usr/bin/emacs-snapshot-lucid) 5a1cbf wait_reading_process_output (/usr/bin/emacs-snapshot-lucid)
Fine. Looking at the Xft sources, it looks like actual allocation of fonts
happens in XftFontOpenInfo()
, past where we check for the font being
already-open. The actual free()
happens in XftFontDestroy()
. I put a probe
in each of these two places, then create/destroy a single client frame, and I
see this:
$ perf probe -x /usr/lib/x86_64-linux-gnu/libXft.so --add 'XftFontOpenInfo_alloc=XftFontOpenInfo:121' $ perf probe -x /usr/lib/x86_64-linux-gnu/libXft.so --add XftFontDestroy $ sudo perf record -i -p `pidof emacs` -eprobe_libXft:XftFont{OpenInfo_alloc,Destroy} [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.097 MB perf.data (7 samples) ] $ perf script emacs-snapshot- 16337 [000] 12141.927687: probe_libXft:XftFontOpenInfo_alloc: (7f7906641620) emacs-snapshot- 16337 [000] 12141.961706: probe_libXft:XftFontOpenInfo_alloc: (7f7906641620) emacs-snapshot- 16337 [000] 12141.968348: probe_libXft:XftFontOpenInfo_alloc: (7f7906641620) emacs-snapshot- 16337 [000] 12142.010935: probe_libXft:XftFontOpenInfo_alloc: (7f7906641620) emacs-snapshot- 16337 [001] 12142.531279: probe_libXft:XftFontDestroy: (7f7906640e40) emacs-snapshot- 16337 [001] 12142.531471: probe_libXft:XftFontDestroy: (7f7906640e40) emacs-snapshot- 16337 [001] 12142.531553: probe_libXft:XftFontDestroy: (7f7906640e40)
Aha. So with each frame we allocate 4 fonts, but destroy only 3. Looking at the
Xft sources, it looks like there are two different reference counters:
info->num_unref_fonts
and font->ref
. When a window is closed,
XftFontManageMemory()
is called, and fonts are killed, based on
info->num_unref_fonts
. However, this is updated in XftFontClose()
based on
font->ref
. So XftFontClose()
must be called by emacs and we can't assume
that a window closing will do that for us.
OK. Good to know. It looks like all the free
calls are all coming from
delete_frame
, which clears out everything in the font cache. How does this
font cache work? Presumably the leaky font is somehow not being added to the
cache, or somehow not being read from there properly. Let me look at the cache.
Most of the allocations come from font_load_for_lface
. I use gdb
to place a
breakpoint, and to look at the cache each time. When I create a frame, I should
see a new cache entry with each allocated font. The emacs/src/.gdbinit
gdb
script has some useful macros, so in font_load_for_lface
I can look at the
font cache with
pp f->output_data.x->display_info->name_list_element
The output of this goes to stderr, which the daemon helpfully redirects to
/dev/null
. I can still see it with sysdig
however. It has a buffering
problem that I get around with using unbuffer
:
$ sudo unbuffer sysdig -s 60000 proc.pid=`pidof emacs` -c stderr | perl -ne 's/\r/\n/g; s/\n$//; print STDERR $_' (":0.0" (x 1) (xft 1)) (":0.0" (x 1) (xft 1 (#<font-spec xft nil Monospace nil iso8859-1 nil nil nil nil nil nil nil ((:name . "monospace-10"))> . [#<font-entity xft unknown DejaVu\ Sans\ Mono nil iso10646-1 bold oblique normal 0 nil 100 0 ((:font-entity "/usr/share/fonts/truetype/dejavu/DejaVuSansMono-BoldOblique.ttf" . 0) (:name . "monospace-10"))> #<font-entity xft unknown DejaVu\ Sans\ Mono nil iso10646-1 bold normal normal 0 nil 100 0 ((:font-entity "/usr/share/fonts/truetype/dejavu/DejaVuSansMono-Bold.ttf" . 0) (:name . "monospace-10"))> #<font-entity xft unknown DejaVu\ Sans\ Mono nil iso10646-1 normal normal normal 0 nil 100 0 ((:font-entity "/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf" . 0) (:name . "monospace-10"))> #<font-entity xft unknown DejaVu\ Sans\ Mono nil iso10646-1 normal oblique normal 0 nil 100 0 ((:font-entity "/usr/share/fonts/truetype/dejavu/DejaVuSansMono-Oblique.ttf" . 0) (:name . "monospace-10"))>]))) (":0.0" (x 1) (xft 1)) (":0.0" (x 1) (xft 1 (#<font-spec xft unknown DejaVu\ Sans\ Mono nil iso8859-1 nil nil nil nil nil nil nil nil> . [#<font-entity xft unknown DejaVu\ Sans\ Mono nil iso10646-1 bold oblique normal 0 nil 100 0 ((:font-entity "/usr/share/fonts/truetype/dejavu/DejaVuSansMono-BoldOblique.ttf" . 0))> #<font-entity xft unknown DejaVu\ Sans\ Mono nil iso10646-1 bold normal normal 0 nil 100 0 ((:font-entity "/usr/share/fonts/truetype/dejavu/DejaVuSansMono-Bold.ttf" . 0))> #<font-entity xft unknown DejaVu\ Sans\ Mono nil iso10646-1 normal normal normal 0 nil 100 0 ((:font-entity "/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf" . 0))> #<font-entity xft unknown DejaVu\ Sans\ Mono nil iso10646-1 normal oblique normal 0 nil 100 0 ((:font-entity "/usr/share/fonts/truetype/dejavu/DejaVuSansMono-Oblique.ttf" . 0))>])))
Aha! So I created a frame, 4 fonts were opened as usual, but for some reason between the 2nd and 3rd font open, the cache was cleared! Since the cache stores the fonts that eventually get deallocated, these fonts end up leaking! Where does this happen? Placing breakpoints into the code where it looks like the clear could be happening doesn't appear to find the culprit, and I want a watchpoint so that gdb can tell me when the cache is touched improperly. I need a watchpoint on an emacs lisp object, which isn't something you can do explicitly, but I can semi-manually find the raw pointer, and set a watchpoint on it. gdb session:
[ breakpoint on font_load_for_lface as before] [ opened frame. breakpoint hit the first time] (gdb) p f->output_data.x->display_info->name_list_element $228 = 24121555 (gdb) xlist $229 = 0x10cc824 Lisp_String $230 = (struct Lisp_String *) 0x10cc820 ":0.0" --- $231 = 0x17011d3 Lisp_Cons $232 = (struct Lisp_Cons *) 0x17011d0 { car = 0xc4e0, u = { cdr = 0x17011c3, chain = 0x17011c3 } } --- $233 = 0x17011a3 Lisp_Cons $234 = (struct Lisp_Cons *) 0x17011a0 { car = 0xc750, u = { cdr = 0x1701193, chain = 0x1701193 } } --- nil (gdb) p $->u.cdr $235 = 24121747 (gdb) xpr Lisp_Cons $236 = (struct Lisp_Cons *) 0x1701190 { car = 0x6, u = { cdr = 0x0, chain = 0x0 } } (gdb) watch ((struct Lisp_Cons *) 0x1701190)->u.cdr Hardware watchpoint 24: ((struct Lisp_Cons *) 0x1701190)->u.cdr [ disable font_load_for_lface breakpoint ] (gdb) disa 6 (gdb) commands 24 Type commands for breakpoint(s) 24, one per line. End with a line saying just "end". >bt 5 >c >end (gdb) c Continuing. Hardware watchpoint 24: ((struct Lisp_Cons *) 0x1701190)->u.cdr Old value = 0 New value = 24122275 XSETCDR (c=24121747, n=24122275) at lisp.h:1194 #0 0x000000000054cdad in XSETCDR (c=24121747, n=24122275) at lisp.h:1194 #1 0x000000000060eb2a in font_list_entities (f=0x19b7d20, spec=21168293) at font.c:2786 #2 0x00000000006107f7 in font_find_for_lface (f=0x19b7d20, attrs=0x7ffcdb167570, spec=21052973, c=-1) at font.c:3261 #3 0x0000000000610b80 in font_load_for_lface (f=0x19b7d20, attrs=0x7ffcdb167570, spec=21052973) at font.c:3334 #4 0x0000000000610f72 in font_open_by_spec (f=0x19b7d20, spec=21052973) at font.c:3428 #5 0x0000000000610fe5 in font_open_by_name (f=0x19b7d20, name=17336596) at font.c:3439 Lisp Backtrace: "x-create-frame" (0xdb1678a0) "x-create-frame-with-faces" (0xdb167dd8) 0x12b9d80 PVEC_COMPILED "apply" (0xdb168450) "frame-creation-function" (0xdb1689f0) "make-frame" (0xdb168f40) "make-frame-on-display" (0xdb1694a8) "server-create-window-system-frame" (0xdb169a78) "server-process-filter" (0xdb169ff8) Hardware watchpoint 24: ((struct Lisp_Cons *) 0x1701190)->u.cdr Old value = 24122275 New value = 0 0x00000000005d2172 in compact_font_cache_entry (entry=24121763) at alloc.c:5313 #0 0x00000000005d2172 in compact_font_cache_entry (entry=24121763) at alloc.c:5313 #1 0x00000000005d221b in compact_font_caches () at alloc.c:5339 #2 0x00000000005d2742 in garbage_collect_1 (end=0x7ffcdb166830) at alloc.c:5515 #3 0x00000000005d2e1d in Fgarbage_collect () at alloc.c:5720 #4 0x000000000054eb21 in maybe_gc () at lisp.h:4515 #5 0x00000000005f638c in Ffuncall (nargs=3, args=0x7ffcdb166988) at eval.c:2584 Lisp Backtrace: "Automatic GC" (0x0) "map-keymap" (0xdb166990) "keymap-canonicalize" (0xdb166f38) "x-create-frame" (0xdb1678a0) "x-create-frame-with-faces" (0xdb167dd8) 0x12b9d80 PVEC_COMPILED "apply" (0xdb168450) "frame-creation-function" (0xdb1689f0) "make-frame" (0xdb168f40) "make-frame-on-display" (0xdb1694a8) "server-create-window-system-frame" (0xdb169a78) "server-process-filter" (0xdb169ff8) Hardware watchpoint 24: ((struct Lisp_Cons *) 0x1701190)->u.cdr
Bam! The font cache was unexpectedly (to me) touched in
compact_font_cache_entry
. Looking at the font cache before and after this call
makes it clear that this is the call that's erroneously clearing the font cache.
Apparently it's part of the mark/sweep GC that runs in the sweep phase. For
whatever reason the leaked font isn't marked, so the GC is clearing it out from
the cache. This seems wrong, since simply taking it out of the cache will result
in it not being freed later, but that's irrelevant: this font is not unused, so
it should have been marked. Looking at the font marking business, the issue
becomes clear: each font entity object contains a list of fonts. In
compact_font_cache_entry
we look for marks on the entity and not in any
individual font. Apparently it is possible for an unmarked entity to contain a
marked font, and this case wasn't being checked. And adding this check to
compact_font_cache_entry
makes this leak go away entirely.
A plot showing the memory usage whilt creating/destroying the first frame before and after the patch, shifted to start at 0:
The steps are bug http://debbugs.gnu.org/cgi/bugreport.cgi?bug=21509, which now dominates the memory leaking during this particular test. Between the steps the memory consumption is now more or less flat, which indicates that this patch is effective. Furthermore, during my actual use of emacs (as opposed to the testing use here) I always have multiple frames open, so in real life bug 21509 does not apply for me at all. There could be other smaller leaks that happen here, but until 21509 is fixed, they're difficult to see. Done!
Update
I discovered that bug 21509 can be worked around by turning off the scroll bar. Here's the same memory usage plot as above, but without bug 21509 confounding things:
Looks like the patch drops the leak from 57kB/s (114 kB/frame) to 12kB/s (24kB/frame). Still work to be done, but this is very good.