Reputation: 5629
Is there a way to read snappy or lzo compressed files on DataFlow using Apache Beam's Python SDK?
Since I couldn't find an easier way, this is my current approach (which seems totally overkill and inefficient):
Upvotes: 0
Views: 745
Reputation: 1725
I don't think that there is any built in way to do this today with beam. Python beam supports Gzip, bzip2 and deflate.
Option 1: Read in the whole files and decompress manually
This solution will likely not perform as fast, and it will not be able to load large files into memory. But if your files are small in size, it might be work well enough.
Option 2: Add a new decompressor to Beam.
You may be able to contribute a decompressor to beam. It looks like you would need to implement the decompressor logic, provide some constants to specify it when authoring a pipleine.
I think one of the constraints is that it must be possible to scan the file and decompress it in chunks at a time. If the compression format requires reading the whole file into memory, then it will likely not work. This is because the TextIO libraries are designed to be record based, which supports reading large files that don't fit into memory and breaking them up into small records for processing.
Upvotes: 2