Last week at the lab I received a data dump: a gzip-compressed tarball with lots
of images in it. The images are all uncompressed .pgm
, with the whole tarball
weighing in at ~ 1TB. I tried to extract it, and after chugging all day, it ran
out of disk space. Added more disk, tried again: out of space again. Just
getting a listing of the archive contents (tar tvfz
) took something like 8
hours.
Clearly this is unreasonable. I made an executive decision to use .jpg
files
instead: I'd take the small image quality hit for the massive gains in storage
efficiency. But the tarball has .pgm
and just extracting the thing is
challenging. So I'm now extracting the archive, and converting all the .pgm
images to .jpg
as soon as they hit disk. How? Glad you asked!
I'm running two parallel terminal sessions (I'm using screen
, but you can do
whatever you like).
Session 1
< archive.tar.gz unpigz -p20 | tar xv
Here I'm just extracting the archive to disk normally. Using unpigz
instead of
plain, old tar
to get parallelization.
Session 2
inotifywait -r PATH -e close_write -m | mawk -Winteractive '/pgm$/ { print $1$3 }' | parallel -v -j10 'convert {} -quality 96 {.}.jpg && rm {}'
This is the secret sauce. I'm using inotifywait
to tell me when any file is
closed for writing in a subdirectory of PATH
. Then I mawk
it to only tell me
when .pgm
files are done being written, then I convert
them to .jpg
, and
delete the .pgm
when that's done. I'm using GNU Parallel to parallelize the
image conversion. Otherwise the image conversion doesn't keep up.
This is going to take at least all day, but I'm reasonably confident that it will actually finish successfully, and I can then actually do stuff with the data.