Overview
I'm currently dealing with large data files composed of whitespace-separated text. These are very well supported by various UNIX tools, which makes it an attractive way to store data. Many processing tools are available, but I discovered that the performance of these tools varies more widely than I had assumed. None of this is news to many people, but I thought it'd be useful to run a few experiments to quantify the differences.
I ran a number of trials looking at a particular data file I had lying around. This file weighs in at about 40MB. It has roughly 120000 lines with about 100 whitespace-separated records each (most records are just a single character). I want to test the most basic possible parsing program:
- read in the data one line at a time
- split each line into fields, chopping off the trailing newline
- print out all but the first field
I'm comparing perl, python, gawk, mawk and cut, all installed from packages on my Debian/sid box running on an amd64 arch. The package versions:
Package name | version |
---|---|
perl | 5.22.2-1 |
python2.7 | 2.7.11-7 |
python3.6 | 3.6.0~b1-1 |
gawk | 1:4.1.3+dfsg-0.1 |
mawk | 1.3.3-17 |
coreutils (for cut ) |
8.25-2 |
Programs under test
Everything is plain ASCII: I have LANG=C
and LC_ALL=C
.
Perl
The perl code looks like this:
use feature 'say'; while(<>) { chop; @F = split; shift @F; $_ = join(' ',@F); say; }
I also tried omitting the while(<>) {... say}
with perl -p
and also omitting
the chop
and split
with perl -a
. This produced much shorter code, but had
no measurable effect on the performance.
Python
The python code looks like this:
import sys for l in sys.stdin: fields = l[:-1].split() fields[:1] = [] scut = ' '.join(fields) print(scut)
The ()
in the print
were included for python3 and omitted for python2.
update
Clément Pit-Claudel noted that python3 does more manipulation of text coming in from a buffer, and suggested the following flavor of the test, in order to bypass some of this overhead:
import sys for l in sys.stdin.buffer: fields = l[:-1].split() fields[:1] = [] scut = b' '.join(fields) sys.stdout.buffer.write(scut + b'\n')
awk
The awk program simply looks like this:
{ $1=""; print; }
This applies to both gawk and mawk. I ran this one as a one-liner, not even bothering to put it into a source file.
cut
Finally, the cut
invocation was an ultimate one-liner:
$ cut -d ' ' -f 2-
Test invocation
All the tests were executed a few times, with the mean wall-clock time being taken. For instance:
$ for i in `seq 10`; do < /tmp/data time python2.7 tst.py > /dev/null; done |& awk '{n=NF-1; s+=$n} END{print s/NR}'
For each application I would take a measurement for the full program, and then I'd cut off commands from the end to get a sense of where the time was going.
Results
The raw results look like this (all timings in seconds):
Everything | without print | without join | without shift | without chop, split | |
---|---|---|---|---|---|
perl | 3.04 | 3.01 | 2.40 | 2.38 | 0.11 |
python2.7 | 1.19 | 1.08 | 0.76 | 0.71 | 0.05 |
python3.6 | 1.64 | 1.30 | 0.97 | 0.89 | 0.13 |
python3.6 (buffers) | 1.43 | 1.21 | 0.77 | 0.71 | 0.08 |
gawk | 1.00 | 0.09 | 0.08 | 0.08 | |
mawk | 0.65 | 0.63 | 0.00 | 0.00 | |
cut | 0.55 |
Taking differences, we get this:
overhead | chop, split | shift | join | ||
---|---|---|---|---|---|
perl | 0.11 | 2.27 | 0.02 | 0.61 | 0.03 |
python2.7 | 0.05 | 0.66 | 0.05 | 0.32 | 0.11 |
python3.6 | 0.13 | 0.76 | 0.08 | 0.33 | 0.34 |
python3.6 (buffers) | 0.08 | 0.63 | 0.06 | 0.44 | 0.22 |
gawk | 0 | 0 | 0 | 0.01 | 0.91 |
mawk | 0 | 0 | 0 | 0.63 | 0.02 |
cut | 0.55 |
As expected, the plain cut
utility from GNU coreutils is fastest, with mawk
being slower, but not terribly so. In a distant 3rd place is gawk
. These all
are what I would expect. The following were surprising, however. Python2.7 is
not too much slower than gawk
. Python3.6 is much slower than Python2.7,
although some of this inefficiency can be bypassed by using special i/o objects.
And perl is way slower than all the others. It's very possible that perl and
python3 are doing something overly-complicated that can be turned off with some
setting, but I don't at this time know what that is.
Looking at the components, it looks like perl has trouble dealing with lists: the splitting and joining are dramatically slower for it than for other tools.
So that's it. I guess the only grand conclusion for me is to be wary of perl for dealing with large data files. If anybody knows how to speed things up in any of these cases, email me.