02

vnlog.slurp() with non-numerical data

For a while now I'd see an annoying problem when trying to analyze data. I would be trying to import into numpy an innocuous-looking data file like this:

#  image   x y z temperature
image1.png 1 2 5 34
image2.png 3 4 1 35

As usual, I would be using vnlog.slurp() (a thin wrapper around numpy.loadtxt()) to read this in, but that doesn't work: the image filenames aren't parseable as numerical values. Up until now I would work around this by using the suprocess module to fork off a vnl-filter -p !image and then slurp that, but it's a pain and slow and has other issues. I just solved this conclusively using the numpy structured dtypes. I can now do this:

dtype = np.dtype([ ('image',       'U16'),
                   ('x y z',       int, (3,)),
                   ('temperature', float), ])

arr = vnlog.slurp("data.vnl", dtype=dtype)

This will read the image filename, the xyz points and the temperature into different sub-arrays, with different types each. Accessing the result looks like this:

print(arr['image'])
---> array(['image1.png', 'image2.png'], dtype='<U16')

print(arr['x y z'])
---> array([[1, 2, 5],
            [3, 4, 1]])

print(arr['temperature'])
---> array([34., 35.])

Notes:

  • The given structured dtype defines both how to organize the data, and which data to extract. So it can be used to read in only a subset of the available columns. Here I could have omitted the temperature column, for instance
  • Sub-arrays are allowed. In the example I could say either
    dtype = np.dtype([ ('image',       'U16'),
                       ('x y z',       int, (3,)),
                       ('temperature', float), ])
    

    or

    dtype = np.dtype([ ('image',       'U16'),
                       ('x',           int),
                       ('y',           int),
                       ('z',           int),
                       ('temperature', float), ])
    

    The latter would read x, y, z into separate, individual arrays. Sometime we want this, sometimes not.

  • Nested structured dtypes are not allowed. Fields inside other fields are not supported, since it's not clear how to map that to a flat vnlog legend
  • If a structured dtype is given, slurp() returns the array only, since the field names are already available in the dtype

We still do not support records with any null values (-). This could probably be handled with the converters kwarg of numpy.loadtxt(), but that sounds slow. I'll look at that later.

This is available today in vnlog 1.38.