In the last few jobs I've worked at I ended up writing a tool to store data in a
nice format, and to be able to manipulate it easily. I'd rewrite this from
scratch each time partly because I was never satisfied with the previous
version. Each iteration was an improvement on the previous one, and this
version is the good one. I wrote it at NASA/JPL, went through the release
process (this thing was called asciilog
then), added a few more features, and
I'm now releasing it. The toolkit lives here and here's the initial README:
Summary
Vnlog (pronounced "vanillog") is a trivially-simple log format:
- A whitespace-separated table of ASCII human-readable text
- Lines beginning with
#
are comments - The first line that begins with a single
#
(not##
or#!
) is a legend, naming each column
Example:
#!/usr/bin/whatever # a b c 1 2 3 ## another comment 4 5 6
Such data works very nicely with normal UNIX tools (awk
, sort
, join
), can
be easily read by fancier tools (numpy
, matlab
(yuck), excel
(yuck), etc),
and can be plotted with feedgnuplot
. This tookit provides some tools to
manipulate vnlog
data and a few libraries to read/write it. The core
philosophy is to keep everything as simple and light as possible, and to provide
methods to enable existing (and familiar!) tools and workflows to be utilized in
nicer ways.
Synopsis
In one terminal, sample the CPU temperature over time, and write the data to a file as it comes in, at 1Hz:
$ ( echo '# time temp1 temp2 temp3'; while true; do echo -n "`date +%s` "; < /proc/acpi/ibm/thermal awk '{print $2,$3,$4; fflush()}'; sleep 1; done ) > /tmp/temperature.vnl
In another terminal, I sample the consumption of CPU resources, and log that to a file:
$ (echo "# user system nice idle waiting hardware_interrupt software_interrupt stolen"; top -b -d1 | awk '/%Cpu/ {print $2,$4,$6,$8,$10,$12,$14,$16; fflush()}') > /tmp/cpu.vnl
These logs are now accumulating, and I can do stuff with them. The legend and the last few measurements:
$ vnl-tail /tmp/temperature.vnl # time temp1 temp2 temp3 1517986631 44 38 34 1517986632 44 38 34 1517986633 44 38 34 1517986634 44 38 35 1517986635 44 38 35 1517986636 44 38 35 1517986637 44 38 35 1517986638 44 38 35 1517986639 44 38 35 1517986640 44 38 34
I grab just the first temperature sensor, and align the columns
$ < /tmp/temperature.vnl vnl-tail | vnl-filter -p time,temp=temp1 | vnl-align # time temp 1517986746 45 1517986747 45 1517986748 46 1517986749 46 1517986750 46 1517986751 46 1517986752 46 1517986753 45 1517986754 45 1517986755 45
I do the same, but read the log data in realtime, and feed it to a plotting tool
to get a live reporting of the cpu temperature. This plot updates as data comes
in. I then spin a CPU core (while true; do true; done
), and see the
temperature climb. Here I'm making an ASCII plot that's pasteable into the docs.
$ < /tmp/temperature.vnl vnl-tail -f | vnl-filter --unbuffered -p time,temp=temp1 | feedgnuplot --stream --domain --lines --timefmt '%s' --set 'format x "%M:%S"' --ymin 40 --unset grid --terminal 'dumb 80,40' 70 +----------------------------------------------------------------------+ | + + + + + + + + + | | | | | | | | ** | 65 |-+ *** +-| | ** * | | * * | | * * | | * * | | ** ** | 60 |-+ * * +-| | * * | | * * | | * * | | ** * | | * * | 55 |-+ * * +-| | * * | | * ** | | * * | | ** * | | * ** | 50 |-+ * ** +-| | * ** | | * *** | | * * | | * **** | | * ***** | 45 |-+ * *********** +-| | ************ ********************** | | * ** | | | | | | + + + + + + + + + | 40 +----------------------------------------------------------------------+ 21:00 22:00 23:00 24:00 25:00 26:00 27:00 28:00 29:00 30:00 31:00
Cool. I can then join the logs, pull out simultaneous CPU consumption and temperature numbers, and plot the path in the temperature-cpu space:
$ vnl-join -j time /tmp/temperature.vnl /tmp/cpu.vnl | vnl-filter -p temp1,user | feedgnuplot --domain --lines --unset grid --terminal 'dumb 80,40' 45 +----------------------------------------------------------------------+ | + + + + + | | * | | * | 40 |-+ ** +-| | ** | | * * | | * * * * * | 35 |-+ **** *********** **** * **** *** ****** +-| | ********* ******** * * ***** *** * ** * * | | * * * * * * * ** * * * | | * * * * * * | 30 |-+ * * +-| | * * | | * * | | * * | 25 |-+ * * +-| | * * | | * * | | * * | 20 |-+ * * +-| | * * | | * * | | * * * | 15 |-+ * * * * +-| | * * * * | | *** * * | | *** * * | 10 |-+ *** * * +-| | *** * * | | *** * * | | *** * * | 5 |-+ *** * * +-| | *** * * | | * * * * * | | **** * ** ***** *********** + ******* ***** | 0 +----------------------------------------------------------------------+ 40 45 50 55 60 65 70
Description
As stated before, vnlog tools are designed to be very simple and light. There exist other tools that are similar. For instance:
- https://csvkit.readthedocs.io/
- https://github.com/johnkerl/miller
- https://github.com/eBay/tsv-utils-dlang
These all provide facilities to run various analyses, and are neither simple nor light. Vnlog by contrast doesn't analyze anything, but makes it easy to write simple bits of awk or perl to process stuff to your heart's content. The main envisioned use case is one-liners, and the tools are geared for that purpose. The above mentioned tools are much more powerful than vnlog, so they could be a better fit for some use cases.
In the spirit of doing as little as possible, the provided tools are wrappers around tools you already have and are familiar with. The provided tools are:
vnl-filter
is a tool to select a subset of the rows/columns in a vnlog and/or to manipulate the contents. This is effectively anawk
wrapper where the fields can be referenced by name instead of index. 20-second tutorial:
vnl-filter -p col1,col2,colx=col3+col4 'col5 > 10' --has col6
will read the input, and produce a vnlog with 3 columns: col1
and col2
from the input and a column colx
that's the sum of col3
and col4
in the
input. Only those rows for which the col5 > 10
is true will be output.
Finally, only those rows that have a non-null value for col6
will be
selected. A null entry is signified by a single -
character.
vnl-filter --eval '{s += x} END {print s}'
will evaluate the given awk program on the input, but the column names work as
you would hope they do: if the input has a column named x
, this would
produce the sum of all values in this column.
vnl-sort
,vnl-join
,vnl-tail
are wrappers around the corresponding GNU Coreutils tools. These work exactly as you would expect also: the columns can be referenced by name, and the legend comment is handled properly. These are wrappers, so all the commandline options those tools have "just work" (except options that don't make sense in the context of vnlog). As an example,vnl-tail -f
will follow a log: data will be read byvnl-tail
as it is written into the log (just liketail -f
, but handling the legend properly). And you already know how to use these tools without even reading the manpages! Note: these were written for and have been tested with the Linux kernel and GNU Coreutilssort
,join
andtail
. Other kernels and tools probably don't (yet) work. Send me patches.vnl-align
aligns vnlog columns for easy interpretation by humans. The meaning is unaffectedVnlog::Parser
is a simple perl library to read a vnloglibvnlog
is a C library to simplify writing a vnlog. Clearly all you really need isprintf()
, but this is useful if we have lots of columns, many containing null values in any given row, and/or if we have parallel threads writing to a logvnl-make-matrix
converts a one-point-per-line vnlog to a matrix of data. I.e.
$ cat dat.vnl # i j x 0 0 1 0 1 2 0 2 3 1 0 4 1 1 5 1 2 6 2 0 7 2 1 8 2 2 9 3 0 10 3 1 11 3 2 12 $ < dat.vnl vnl-filter -p i,x | vnl-make-matrix --outdir /tmp Writing to '/tmp/x.matrix' $ cat /tmp/x.matrix 1 2 3 4 5 6 7 8 9 10 11 12
All the tools have manpages that contain more detail. And tools will probably be added with time.
C interface
For most uses, these logfiles are simple enough to be generated with plain prints. But then each print statement has to know which numeric column we're populating, which becomes effortful with many columns. In my usage it's common to have a large parallelized C program that's writing logs with hundreds of columns where any one record would contain only a subset of the columns. In such a case, it's helpful to have a library that can output the log files. This is available. Basic usage looks like this:
In a shell:
$ vnl-gen-header 'int w' 'uint8_t x' 'char* y' 'double z' 'void* binary' > vnlog_fields_generated.h
In a C program test.c:
#include "vnlog_fields_generated.h" int main() { vnlog_emit_legend(); vnlog_set_field_value__w(-10); vnlog_set_field_value__x(40); vnlog_set_field_value__y("asdf"); vnlog_emit_record(); vnlog_set_field_value__z(0.3); vnlog_set_field_value__x(50); vnlog_set_field_value__w(-20); vnlog_set_field_value__binary("\x01\x02\x03", 3); vnlog_emit_record(); vnlog_set_field_value__w(-30); vnlog_set_field_value__x(10); vnlog_set_field_value__y("whoa"); vnlog_set_field_value__z(0.5); vnlog_emit_record(); return 0; }
Then we build and run, and we get
$ cc -o test test.c -lvnlog $ ./test # w x y z binary -10 40 asdf - - -20 50 - 0.2999999999999999889 AQID -30 10 whoa 0.5 -
The binary field in base64-encoded. This is a rarely-used feature, but sometimes you really need to log binary data for later processing, and this makes it possible.
So you
- Generate the header to define your columns
- Call
vnlog_emit_legend()
- Call
vnlog_set_field_value__...()
for each field you want to set in that row. - Call
vnlog_emit_record()
to write the row and to reset all fields for the next row. Any fields unset with avnlog_set_field_value__...()
call are written as null:-
This is enough for 99% of the use cases. Things get a bit more complex if we have have threading or if we have multiple vnlog ouput streams in the same program. For both of these we use vnlog contexts.
To support reentrant writing into the same vnlog by multiple threads, each log-writer should create a context, and use it when talking to vnlog. The context functions will make sure that the fields in each context are independent and that the output records won't clobber each other:
void child_writer( // the parent context also writes to this vnlog. Pass NULL to // use the global one struct vnlog_context_t* ctx_parent ) { struct vnlog_context_t ctx; vnlog_init_child_ctx(&ctx, ctx_parent); while(records) { vnlog_set_field_value_ctx__xxx(&ctx, ...); vnlog_set_field_value_ctx__yyy(&ctx, ...); vnlog_set_field_value_ctx__zzz(&ctx, ...); vnlog_emit_record_ctx(&ctx); } }
If we want to have multiple independent vnlog writers to different streams (with different columns andlegends), we do this instead:
file1.c
:
#include "vnlog_fields_generated1.h" void f(void) { // Write some data out to the default context and default output (STDOUT) vnlog_emit_legend(); ... vnlog_set_field_value__xxx(...); vnlog_set_field_value__yyy(...); ... vnlog_emit_record(); }
file2.c
:
#include "vnlog_fields_generated2.h" void g(void) { // Make a new session context, send output to a different file, write // out legend, and send out the data struct vnlog_context_t ctx; vnlog_init_session_ctx(&ctx); FILE* fp = fopen(...); vnlog_set_output_FILE(&ctx, fp); vnlog_emit_legend_ctx(&ctx); ... vnlog_set_field_value__a(...); vnlog_set_field_value__b(...); ... vnlog_emit_record(); }
Note that it's the user's responsibility to make sure the new sessions go to a
different FILE
by invoking vnlog_set_output_FILE()
. Furthermore, note that
the included vnlog_fields_....h
file defines the fields we're writing to; and
if we have multiple different vnlog field definitions in the same program (as in
this example), then the different writers must live in different source files.
The compiler will barf if you try to #include
two different
vnlog_fields_....h
files in the same source.
More APIs are
vnlog_printf(...)
and vnlog_printf_ctx(ctx, ...)
write to a pipe like
printf()
does. This exists for comments.
vnlog_clear_fields_ctx(ctx, do_free_binary)
:
Clears out the data in a context and makes it ready to be used for the next
record. It is rare for the user to have to call this manually. The most common
case is handled automatically (clearing out a context after emitting a record).
One area where this is useful is when making a copy of a context:
struct vnlog_context_t ctx1; // .... do stuff with ctx1 ... add data to it ... struct vnlog_context_t ctx2 = ctx1; // ctx1 and ctx2 now both have the same data, and the same pointers to // binary data. I need to get rid of the pointer references in ctx1 vnlog_clear_fields_ctx(&ctx1, false);
vnlog_free_ctx(ctx)
:
Frees memory for an vnlog context. Do this before throwing the context away. Currently this is only needed for context that have binary fields, but this should be called in for all contexts, just in case
numpy interface
The built-in numpy.loadtxt
numpy.savetxt
functions work well to read and
write these files. For example to write to standard output a vnlog with fields
a
, b
and c
:
numpy.savetxt(sys.stdout, array, fmt="%g", header="a b c")
Note that numpy automatically adds the #
to the header. To read a vnlog from a
file on disk, do something like
array = numpy.loadtxt('data.vnl')
These functions know that #
lines are comments, but don't interpret anything
as field headers. That's easy to do, so I'm not providing any helper libraries.
I might do that at some point, but in the meantime, patches are welcome.
Caveats and bugs
The tools that wrap GNU coreutils (vnl-sort
, vnl-join
, vnl-tail
) are
written specifically to work with the Linux kernel and the GNU coreutils. None
of these have been tested with BSD tools or with non-Linux kernels, and I'm sure
things don't just work. It's probably not too effortful to get that running, but
somebody needs to at least bug me for that. Or better yet, send me nice
patches :)
These tools are meant to be simple, so some things are hard requirements. A big one is that columns are whitespace-separated. There is no mechanism for escaping or quoting whitespace into a single field. I think supporting something like that is more trouble than it's worth.
Repository
Author
Dima Kogan (dima@secretsauce.net
)
License and copyright
This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.
Copyright 2016-2017 California Institute of Technology
Copyright 2017-2018 Dima Kogan (dima@secretsauce.net
)
b64_cencode.c
comes from cencode.c
in the libb64
project. It is written by
Chris Venter (chris.venter@gmail.com
) who placed it in the public domain. The
full text of the license is in that file.