I just added a new tool to the vnlog toolkit
: vnl-uniq
. Similar to the
others, this one is a wrapper for the uniq
tool in GNU coreutils. It reads
just enough of the input to get the legend, writes out the (possibly-modified)
legend, and then calls exec
to pass control to uniq
to handle the rest of
the data stream (i.e. to do all the actual work). The primary use case is to
make histograms:
$ cat objects.vnl # size color 1 blue 2 yellow 1 yellow 5 blue 3 yellow 4 orange 2 orange $ < objects.vnl vnl-filter -p color | vnl-sort -k color | vnl-uniq -c # count color 2 blue 2 orange 3 yellow
I also added a --vnl-count NAME
to be able to name the count
column.
As happens each time I wrap one of these tools, I end up reading the
documentation, and learning about new options. Apparently uniq
knows how to
use a subset of the fields when testing for uniqueness: uniq -f N
skips the
first N
columns for the purposes of uniqueness. Naturally, vnl-uniq
supports
this, and I added an extension: negative N
can be passed-in to use only the
last -N
columns. So to use just the one last column, pass -f -1
. This allows
the above to be invoked a bit more simply:
$ < objects.vnl vnl-sort -k color | vnl-uniq -c -f-1 # count size color 2 1 blue 2 2 orange 3 1 yellow
Note that I didn't need to filter the input to throw out the columns I wasn't
interested in. And as a side-effect, the output of vnl-uniq
now has the size
column also: this is the first size in a group of identical colors. Unclear if
this is useful, but it's what uniq
does. Speaking of groups, something that
is useful is uniq --group
, which adds visual separation to groups of
identical fields. To report the full dataset, grouped by color:
$ < objects.vnl vnl-sort -k color | vnl-uniq --group -f-1 # size color 1 blue 5 blue 2 orange 4 orange 1 yellow 2 yellow 3 yellow
It looks like uniq
provides no way to combine this with the counts (which
makes sense, given that uniq
makes one pass through the data), but this can be
done by doing a join first. Looks complicated, but it's really not that bad:
$ vnl-join -j color <( < objects.vnl vnl-sort -k color ) <( < objects.vnl vnl-filter -p color | vnl-sort -k color | vnl-uniq -c -f-1 ) | vnl-filter -p '!color',color | vnl-align | vnl-uniq --group -f-1 # size count color 1 2 blue 5 2 blue 2 2 orange 4 2 orange 1 3 yellow 2 3 yellow 3 3 yellow
It's awkward that uniq
works off trailing fields but join
puts the key field
at the front, but that's how it is. If I care enough, I may add some sort of
vnl-uniq --vnl-field F
to make this nicer, but it's not obviously worth the
typing.