Last week at the lab I received a data dump: a gzip-compressed tarball with lots
of images in it. The images are all uncompressed .pgm, with the whole tarball
weighing in at ~ 1TB. I tried to extract it, and after chugging all day, it ran
out of disk space. Added more disk, tried again: out of space again. Just
getting a listing of the archive contents (tar tvfz) took something like 8
hours.
Clearly this is unreasonable. I made an executive decision to use .jpg files
instead: I'd take the small image quality hit for the massive gains in storage
efficiency. But the tarball has .pgm and just extracting the thing is
challenging. So I'm now extracting the archive, and converting all the .pgm
images to .jpg as soon as they hit disk. How? Glad you asked!
I'm running two parallel terminal sessions (I'm using screen, but you can do
whatever you like).
Session 1
< archive.tar.gz unpigz -p20 | tar xv
Here I'm just extracting the archive to disk normally. Using unpigz instead of
plain, old tar to get parallelization.
Session 2
inotifywait -r PATH -e close_write -m | mawk -Winteractive '/pgm$/ { print $1$3 }' | parallel -v -j10 'convert {} -quality 96 {.}.jpg && rm {}'
This is the secret sauce. I'm using inotifywait to tell me when any file is
closed for writing in a subdirectory of PATH. Then I mawk it to only tell me
when .pgm files are done being written, then I convert them to .jpg, and
delete the .pgm when that's done. I'm using GNU Parallel to parallelize the
image conversion. Otherwise the image conversion doesn't keep up.
This is going to take at least all day, but I'm reasonably confident that it will actually finish successfully, and I can then actually do stuff with the data.