Read snappy or lzo compressed files from DataFlow in Python

Question

Is there a way to read snappy or lzo compressed files on DataFlow using Apache Beam's Python SDK?

Since I couldn't find an easier way, this is my current approach (which seems totally overkill and inefficient):

Start DataProc cluster
Uncompress such data using hive in the new cluster and place it in a temporary location
Stop DataProc cluster
Run the DataFlow job that reads from these temporary uncompressed data
Clean up temporary uncompressed data

Alex Amato · Accepted Answer

I don't think that there is any built in way to do this today with beam. Python beam supports Gzip, bzip2 and deflate.

Option 1: Read in the whole files and decompress manually

Create a custom source to produce a list of filenames (I.e. seeded off of a pipeline option by listing a directory), and emit those as records
In the following ParDo, read each file manually and decompress it. You will need to use a GCS library to read the GCS file, if you have stored your data there.

This solution will likely not perform as fast, and it will not be able to load large files into memory. But if your files are small in size, it might be work well enough.

Option 2: Add a new decompressor to Beam.

You may be able to contribute a decompressor to beam. It looks like you would need to implement the decompressor logic, provide some constants to specify it when authoring a pipleine.

I think one of the constraints is that it must be possible to scan the file and decompress it in chunks at a time. If the compression format requires reading the whole file into memory, then it will likely not work. This is because the TextIO libraries are designed to be record based, which supports reading large files that don't fit into memory and breaking them up into small records for processing.

Read snappy or lzo compressed files from DataFlow in Python

Answers (1)

Related Questions