More demos of vnlog
and feedgnuplot
usage! This is pretty pointless, but
should be a decent demo of the tools at least. This is a demo, not
documentation; so for usage details consult the normal docs.
Each Wednesday night I join a group bike ride. This is an organized affair, and each week an email precedes the ride, very roughly describing the route. The two organizers alternate leading the ride each week, and consequently the emails alternate also. I was getting the feeling that some of the announcements show up in my mailbxo more punctually than others, and after a recent 20-minutes-before-the ride email, I decided this just had to be quantified.
The emails all go to a google-group email. The google-groups people are a wheel-reinventing bunch, so talking to the archive can't be done with normal tools (NNTP? mbox files? No?). A brief search revealed somebody's home-grown tool to programmatically grab the archive:
https://github.com/icy/google-group-crawler.git
The docs look funny, but are actually correct: you really do run the script to download stuff and generate another script; and then run that script to download the rest of the stuff.
Anyway, I used that tool to grab all the emails that are available. Then I wrote a quick/dirty script to parse out the data I care about and dump everything into a vnlog:
#!/usr/bin/perl use strict; use warnings; use feature ':5.10'; my %daysofweek = ('Mon' => 0, 'Tue' => 1, 'Wed' => 2, 'Thu' => 3, 'Fri' => 4, 'Sat' => 5, 'Sun' => 6); my %months = ('Jan' => 1, 'Feb' => 2, 'Mar' => 3, 'Apr' => 4, 'May' => 5, 'Jun' => 6, 'Jul' => 7, 'Aug' => 8, 'Sep' => 9, 'Oct' => 10, 'Nov' => 11, 'Dec' => 12); say '# path ridenum who whenwedh date wordcount subject'; for my $path (<mbox/m.*>) { my ($ridenum,$who,$date,$whenwedh,$subject); my $wordcount = 0; my $inbody = undef; open FD, '<', $path; while(<FD>) { if( !$inbody && /^From: *(.*?)\s*$/ ) { $who = $1; if( $who =~ /sean/i) { $who = 'sean'; } elsif($who =~ /nathan/i) { $who = 'nathan'; } else { $who = 'other'; } } if( !$inbody && /^Subject: \s* (?:=\?UTF-8\?Q\?)? (.*?) \s* $/x ) { $subject = $1; ($ridenum) = $subject =~ /^(?: \# | (?:=\?ISO-8859-1\?Q\?=23) ) ([0-9]+)/x; $subject =~ s/[\s#]//g; } if( !$inbody && /^Date: *(.*?)\s*$/ ) { $date = $1; my ($zone) = $date =~ / (\(.+\) | -0700 | -0800) /x; if( !defined $zone) { die "No timezone in: '$date'"; } if( $zone !~ /PST|PDT|-0700|-0800/) { die "Unexpected timezone: '$zone'"; } my ($Dayofweek,$D,$M,$Y,$h,$m,$s) = $date =~ /^(...),? +(\d+) +([a-zA-Z]+) +(20\d\d) +(\d\d):(\d\d):(\d\d)/; if( !(defined $Dayofweek && defined $h && defined $m && defined $s) ) { die "Unparseable date '$date'"; } my $dayofweek = $daysofweek{$Dayofweek} // die "Unparseable day-of-week '$Dayofweek'"; my $t = $dayofweek*24 + $h + ($m + $s/60)/60; my $twed0 = 2*24; # start of wed $M = $months{$M} // die "Unknown month '$M'. Line: '$_'"; $date = sprintf('%04d%02d%02d', $Y,$M,$D); $whenwedh = $t - $twed0; } if( !$inbody && /^[\r\n]*$/ ) { $inbody = 1; } if( $inbody ) { if( /------=_Part/ || /Content-Type:/) { last if $wordcount > 0; $inbody = undef; next; } my @words = /(\w+)/g; $wordcount += @words; } } close FD; $who //= '-'; $subject //= '-'; $ridenum //= '-'; $date //= '-'; $whenwedh //= '-'; say "$path $ridenum $who $whenwedh $date $wordcount $subject"; }
The script isn't important, and the resulting data is here. Now that I have a log on disk, I can do stuff with it. The first few lines of the log look like this:
dima@scrawny:~/projects/passagemining/google-group-crawler/the-passage-announcements$ < rides.vnl head # path ridenum who whenwedh date wordcount subject mbox/m.-EF1u5bbw5A.SywitKQ3y1sJ 265 sean 1.40722222222222 20140903 190 265-Coasting mbox/m.-JdiiTIvyYs.Jgy_rCiwAGAJ 151 sean 18.6441666666667 20120606 199 151-FinalsWeek mbox/m.-l6z9-1WC78.SgP3ytLsDAAJ 312 nathan 19.5394444444444 20150812 189 312-SpaceFilling mbox/m.-vfVuoUxJ0w.FwpRRWC7EgAJ 367 nathan 18.1766666666667 20160831 164 367-Dislocation mbox/m.-YHTEvmbIyU.HHWjbs_xpesJ 110 sean 10.9108333333333 20110810 407 110-SouslesParcs,laPoubelle mbox/m.0__GMaUD_O8.Pjupq0AwBAAJ 404 sean 13.5255555555556 20170524 560 404-Bumped mbox/m.0CT9ybx3uIU.sdZGwo8rSQUJ 53 sean -23.1402777777778 20100629 223 53WeInventedtheRemix mbox/m.0FtQxCkxVHA.AjhGJ7mgAwAJ 413 nathan 20.4155555555556 20170726 178 413-GradientAssent mbox/m.0haCNC_N2fY.bJ-93LQSFQAJ 337 nathan 57.3708333333333 20160205 479 337-TheCronutRide
I can align the columns to make it more human-readable:
dima@scrawny:~/projects/passagemining/google-group-crawler/the-passage-announcements$ < rides.vnl head | vnl-align # path ridenum who whenwedh date wordcount subject mbox/m.-EF1u5bbw5A.SywitKQ3y1sJ 265 sean 1.40722222222222 20140903 190 265-Coasting mbox/m.-JdiiTIvyYs.Jgy_rCiwAGAJ 151 sean 18.6441666666667 20120606 199 151-FinalsWeek mbox/m.-l6z9-1WC78.SgP3ytLsDAAJ 312 nathan 19.5394444444444 20150812 189 312-SpaceFilling mbox/m.-vfVuoUxJ0w.FwpRRWC7EgAJ 367 nathan 18.1766666666667 20160831 164 367-Dislocation mbox/m.-YHTEvmbIyU.HHWjbs_xpesJ 110 sean 10.9108333333333 20110810 407 110-SouslesParcs,laPoubelle mbox/m.0__GMaUD_O8.Pjupq0AwBAAJ 404 sean 13.5255555555556 20170524 560 404-Bumped mbox/m.0CT9ybx3uIU.sdZGwo8rSQUJ 53 sean -23.1402777777778 20100629 223 53WeInventedtheRemix mbox/m.0FtQxCkxVHA.AjhGJ7mgAwAJ 413 nathan 20.4155555555556 20170726 178 413-GradientAssent mbox/m.0haCNC_N2fY.bJ-93LQSFQAJ 337 nathan 57.3708333333333 20160205 479 337-TheCronutRide dima@scrawny:~/projects/passagemining/google-group-crawler/the-passage-announcements$
If memory serves, we're at around ride 450 right now. Is that right?
$ < rides.vnl vnl-sort -nr -k ridenum | head -n2 | vnl-filter -p ridenum # ridenum 452
Cool. This command was longer than it needed to be in order to produce nicer output. If I was exploring the dataset, I'd save keystrokes and do this instead:
$ < rides.vnl vnl-sort -nrk ridenum | head # path ridenum who whenwedh date wordcount subject mbox/m.7TnUbcShAz8.67KgwBGhAAAJ 452 nathan 20.7694444444444 20180502 175 452-CastingtoType mbox/m.ej7Oz6sDzgc.bEnN04VEAQAJ 451 sean 0.780833333333334 20180425 258 451-Recovery mbox/m.LWfydBtpd_s.35SgEJEqAgAJ 450 nathan 67.9608333333333 20180420 659 450-AnotherGreenWorld mbox/m.3mv-Cm0EzkM.oAm3MkNYCAAJ 449 sean 17.5875 20180411 290 449-DoYouHaveRockNRoll? mbox/m.AEV4ukSjO5U.IPlUabfEBgAJ 448 nathan 20.6138888888889 20180404 175 448-TheThirdString mbox/m.bYTM6kgxtJs.5iHcVQKPBAAJ 447 sean 15.8355555555556 20180328 196 447-PassParticiple mbox/m.tHMqRWp9o_Y.FQ8hFvnqCQAJ 446 nathan 20.5213888888889 20180321 139 446-Chiaroscuro mbox/m.jr0SBsDBzgk.UHrbCv4VBQAJ 445 sean 15.3280555555556 20180314 111 445-85% mbox/m.K2Yg_FRXuAo.SyViTwXXAQAJ 444 nathan 19.6180555555556 20180307 171 444-BackintheLoop
OK, how far back does the archive go? I do the same thing as before, but sort in the opposite order to find the ealiest rides
$ < rides.vnl vnl-sort -n -k ridenum | head -n2 | vnl-filter -p ridenum # ridenum
Nothing. That's odd. Let me look at whole records, and at more than just the first two lines
$ < rides.vnl vnl-sort -n -k ridenum | head | vnl-align # path ridenum who whenwedh date wordcount subject mbox/m.2gywN9pxMI4.40UBrDjnAwAJ - nathan 17.6572222222222 20171206 95 Noridetonight;daytimeridethisSaturday! mbox/m.49fZsvZac_U.a0CazPinCAAJ - sean -34.495 20170320 463 Extraridethisweekend+Passage400save-the-date mbox/m.5gJd21W24vo.ICDEHrnQJvcJ - nathan 12.1063888888889 20130619 172 NoPassageRideTonight;GalleryOpeningTomorrowNight mbox/m.7qEbhBWSN1U.Cx6cxYTECgAJ - nathan 17.7891666666667 20180418 134 Noridetonight;Passage450onSaturday! mbox/m.DVssP4Th__4.jXzzu9clZLQJ - sean 20.9138888888889 20101222 209 TheWrathofTlaloc mbox/m.E6etBSqEQIc.C35-SkBllHoJ - sean 50.7575 20131220 292 Noridenextweek;seeyounextyear mbox/m.GyJ16HiK8Ds.z6yNC4W5SeUJ - sean -11.5666666666667 20120529 228 NoRideThisWeek!...AIDS/Lifecycle...ThirdAnniversary mbox/m.H3QGBvjeTfM.CS-xRn1WDQAJ - sean 17.0180555555555 20171227 257 Noridetonight;nextride1/6 mbox/m.K2P6D_BGfYU.ve6a_8l6AAAJ - sean 37.8166666666667 20170223 150 RemainingPassageRouteMapShirtsAvailableforPurchase
Aha. A bunch of emails aren't announncing a ride, but are announcing that there's no ride that week. Let's ignore those
$ < rides.vnl vnl-filter -p +ridenum | vnl-sort -n -k ridenum | head -n2 # ridenum 52
Bam. So we have emails going back to ride 52. Good enough. All right. I'm aiming to create a time histogram for Sean's emails and another for Nathan's emails. What about emails that came from neither one? In theory there shouldn't be any of those, but there could be a parsing error, or who knows what.
$ < rides.vnl vnl-filter 'who == "other"' # path ridenum who whenwedh date wordcount subject mbox/m.A-I0_i9-YOs.QRX1P99_uiUJ 65 other 65.1413888888889 20100917 330 65-LosAngelesRidesItself+specialscreening mbox/m.pHpzsjH7H68.O7CP_v6bcEoJ 67 other 16.5663888888889 20101006 50 67Sortition,NotSaturation
OK. Exactly 2 emails out of hundreds. That's not bad, and I'll just ignore those. Out of curiosity, what happened? Is this a parsing error?
$ grep From: $(< rides.vnl vnl-filter 'who == "other"' --eval '{print path}') mbox/m.A-I0_i9-YOs.QRX1P99_uiUJ:From: The Passage Announcements <the-passage-...@googlegroups.com> mbox/m.pHpzsjH7H68.O7CP_v6bcEoJ:From: The Passage Announcements <the-passage-...@googlegroups.com>
So on rides 65 and 67 "The Passage Announcements" emailed themselves. Oops. Since the ride leaders alternate, I can infer who actually sent these by looking at the few rides around this one:
$ < rides.vnl vnl-filter 'ridenum > 60 && ridenum < 70' -p ridenum,who | vnl-sort -n -k ridenum # ridenum who 61 sean 62 nathan 63 sean 64 nathan 65 other 66 nathan 67 other 68 nathan 69 sean
That's pretty conclusive: clearly these emails came from Sean. I'm still going to ignore them, though.
The ride is on Wed evening, and the emails generally come in the day or two before then. Does my data set contain any data outside this reasonable range? Hopefully very little, just like the "other" author emails.
$ < rides.vnl vnl-filter --has ridenum -p whenwedh | feedgnuplot --histo 0 --binwidth 1 --xlabel 'Hour (on Wed)' --ylabel 'Email frequency'
The ride starts at 21:00 on Wed, and we see a nice spike immediately before. The smaller cluster prior to that is the emails that go out the night before. There's a tiny number of stragglers going out the previous day (that I'm simply going to ignore). And there're a number of emails going out after Wed. These likely announce an occasional weekend ride that I will also ignore. But let's do check. How many are there?
$ < rides.vnl vnl-filter --has ridenum 'whenwedh > 22' | wc -l 16
Looking at these manually, most are indeed weekend rides, with a small number of actual extra-early announcements for Wed. I can parse the email text more fancily to pull those out, but that's really not worth my time.
OK. I'm now ready for the main thing.
$ < rides.vnl | vnl-filter --has ridenum 'who != "other"' -p who,whenwedh | feedgnuplot --dataid --autolegend --histo sean,nathan --binwidth 0.5 --style sean 'with boxes fill transparent solid 0.3 border lt -1' --style nathan 'with boxes fill transparent pattern 1 border lt -1' --xmin -12 --xmax 24 --xlabel "Time (hour)" --ylabel 'Email frequency' --set 'xtics ("12\n(Tue)" -12,"16\n(Tue)" -8,"20\n(Tue)" -4,"0\n(Wed)" 0,"4\n(Wed)" 4,"8\n(Wed)" 8,"12\n(Wed)" 12,"16\n(Wed)" 16,"21\n(Wed)" 21,"0\n(Thu)" 24)' --set 'arrow from 21, graph 0 to 21, graph 1 nohead lw 3 lc "red"' --title "Passage email timing distribution"
This looks verbose, but most of the plotting command is there to make things look nice. When analyzing stuff, I'd omit most of that. Anyway, I can now see what I suspected: Nathan is a procrastinator! His emails almost always come in on Wed, usually an hour or two before the deadline. Sean's emails are bimodal: one set comes in on Wed afternoon, and another in the extreme early morning on Wed. Presumably he sleeps in-between.
We have more data, so we can make more pointless plots. For instance, what does the verbosity of the emails look like? Is one sender more verbose than another?
$ < rides.vnl vnl-sort -n -k ridenum | vnl-filter 'who != "other"' -p +ridenum,who,wordcount | feedgnuplot --lines --domain --dataid --autolegend --xlabel 'Ride number' --ylabel 'Words per email'
$ < rides.vnl vnl-filter 'who != "other"' --has ridenum -p who,wordcount | feedgnuplot --dataid --autolegend --histo sean,nathan --binwidth 20 --style sean 'with boxes fill transparent solid 0.3 border lt -1' --style nathan 'with boxes fill transparent pattern 1 border lt -1' --xlabel "Words per email" --ylabel 'frequency' --title "Passage verbosity distribution"
The time series doesn't obviously say anything, but from the histogram, it looks like Sean is a bit more verbose, maybe? What's the average?
$ < rides.vnl vnl-filter --eval 'ridenum != "-" { if(who == "sean") { Ns++; Ws+=wordcount; } if(who == "nathan") { Nn++; Wn+=wordcount; } } END { print "Mean verbosity sean,nathan: "Ws/Ns, Wn/Nn }' Mean verbosity sean,nathan: 304.955 250.425
Indeed. Is the verbosity time-dependent? Is anybody getting more or less verbose over the years? The time-series plot above is pretty noisy, so it's not clear. Let's filter it to reduce the noise. We're getting into an area that's too complicated for these tools, and moving to something more substantial at this point would be warranted. But I'll do one more thing with these tools, and then stop. I can implement a half-assed filter by time-shifting the verbosity series, re-joining the shifted series, and computing the mean. I do this separately for the two email authors, and then re-combine the series. I could join these two, but simply catting the two data sets together is sufficient here.
$ < rides.vnl vnl-sort -n -k ridenum | vnl-filter 'who == "nathan"' --has ridenum | vnl-filter -p ridenum,idx=NR,wordcount > nathanrp0 $ < rides.vnl vnl-sort -n -k ridenum | vnl-filter 'who == "nathan"' --has ridenum | vnl-filter -p ridenum,idx=NR-1,wordcount > nathanrp-1 $ < rides.vnl vnl-sort -n -k ridenum | vnl-filter 'who == "nathan"' --has ridenum | vnl-filter -p ridenum,idx=NR+1,wordcount > nathanrp+1 $ ... same for Sean ... $ cat <(vnl-join --vnl-suffix2 after --vnl-sort n -j idx <(vnl-join --vnl-suffix2 before --vnl-sort n -j idx nathanrp{0,-1}) nathanrp+1 | vnl-filter -p ridenum,who='"nathan"','wordcountfiltered=(wordcount+wordcountbefore+wordcountafter)/3') <(vnl-join --vnl-suffix2 after --vnl-sort n -j idx <(vnl-join --vnl-suffix2 before --vnl-sort n -j idx seanrp{0,-1}) seanrp+1 | vnl-filter -p ridenum,who='"sean"','wordcountfiltered=(wordcount+wordcountbefore+wordcountafter)/3') | feedgnuplot --lines --domain --dataid --autolegend --xlabel 'Ride number' --ylabel 'Words per email'
Whew. Clearly this was doable, but that's a one-liner that has clearly gotten out of hand, and pushing it further would be unwise. Looking at the data there isn't any obvious time dependence. But what you can clearly see is the extra verbiage around the round-number rides 100, 200, 300, 350, 400, etc. These were often a special weekend ride, with the email containing lots of extra instructions and such.
This was all clearly a waste of time, but as a demo of vnlog workflows, this was ok.