Overview

I'm currently dealing with large data files composed of whitespace-separated text. These are very well supported by various UNIX tools, which makes it an attractive way to store data. Many processing tools are available, but I discovered that the performance of these tools varies more widely than I had assumed. None of this is news to many people, but I thought it'd be useful to run a few experiments to quantify the differences.

I ran a number of trials looking at a particular data file I had lying around. This file weighs in at about 40MB. It has roughly 120000 lines with about 100 whitespace-separated records each (most records are just a single character). I want to test the most basic possible parsing program:

  1. read in the data one line at a time
  2. split each line into fields, chopping off the trailing newline
  3. print out all but the first field

I'm comparing perl, python, gawk, mawk and cut, all installed from packages on my Debian/sid box running on an amd64 arch. The package versions:

Package name version
perl 5.22.2-1
python2.7 2.7.11-7
python3.6 3.6.0~b1-1
gawk 1:4.1.3+dfsg-0.1
mawk 1.3.3-17
coreutils (for cut) 8.25-2

Programs under test

Everything is plain ASCII: I have LANG=C and LC_ALL=C.

Perl

The perl code looks like this:

use feature 'say';
while(<>)
{
    chop;
    @F = split;
    shift @F;
    $_ = join(' ',@F);
    say;
}

I also tried omitting the while(<>) {... say} with perl -p and also omitting the chop and split with perl -a. This produced much shorter code, but had no measurable effect on the performance.

Python

The python code looks like this:

import sys

for l in sys.stdin:
    fields     = l[:-1].split()
    fields[:1] = []
    scut       = ' '.join(fields)
    print(scut)

The () in the print were included for python3 and omitted for python2.

update

Clément Pit-Claudel noted that python3 does more manipulation of text coming in from a buffer, and suggested the following flavor of the test, in order to bypass some of this overhead:

import sys

for l in sys.stdin.buffer:
    fields     = l[:-1].split()
    fields[:1] = []
    scut       = b' '.join(fields)
    sys.stdout.buffer.write(scut + b'\n')

awk

The awk program simply looks like this:

{ $1=""; print; }

This applies to both gawk and mawk. I ran this one as a one-liner, not even bothering to put it into a source file.

cut

Finally, the cut invocation was an ultimate one-liner:

$ cut -d ' ' -f 2-

Test invocation

All the tests were executed a few times, with the mean wall-clock time being taken. For instance:

$ for i in `seq 10`; do < /tmp/data time python2.7 tst.py > /dev/null; done |& awk '{n=NF-1; s+=$n} END{print s/NR}'

For each application I would take a measurement for the full program, and then I'd cut off commands from the end to get a sense of where the time was going.

Results

The raw results look like this (all timings in seconds):

  Everything without print without join without shift without chop, split
perl 3.04 3.01 2.40 2.38 0.11
python2.7 1.19 1.08 0.76 0.71 0.05
python3.6 1.64 1.30 0.97 0.89 0.13
python3.6 (buffers) 1.43 1.21 0.77 0.71 0.08
gawk 1.00 0.09 0.08 0.08  
mawk 0.65 0.63 0.00 0.00  
cut 0.55        

Taking differences, we get this:

  overhead chop, split shift join print
perl 0.11 2.27 0.02 0.61 0.03
python2.7 0.05 0.66 0.05 0.32 0.11
python3.6 0.13 0.76 0.08 0.33 0.34
python3.6 (buffers) 0.08 0.63 0.06 0.44 0.22
gawk 0 0 0 0.01 0.91
mawk 0 0 0 0.63 0.02
cut 0.55        

As expected, the plain cut utility from GNU coreutils is fastest, with mawk being slower, but not terribly so. In a distant 3rd place is gawk. These all are what I would expect. The following were surprising, however. Python2.7 is not too much slower than gawk. Python3.6 is much slower than Python2.7, although some of this inefficiency can be bypassed by using special i/o objects. And perl is way slower than all the others. It's very possible that perl and python3 are doing something overly-complicated that can be turned off with some setting, but I don't at this time know what that is.

Looking at the components, it looks like perl has trouble dealing with lists: the splitting and joining are dramatically slower for it than for other tools.

So that's it. I guess the only grand conclusion for me is to be wary of perl for dealing with large data files. If anybody knows how to speed things up in any of these cases, email me.