Oriol Nieto
Oriol Nieto

Reputation: 5629

Read snappy or lzo compressed files from DataFlow in Python

Is there a way to read snappy or lzo compressed files on DataFlow using Apache Beam's Python SDK?

Since I couldn't find an easier way, this is my current approach (which seems totally overkill and inefficient):

Upvotes: 0

Views: 745

Answers (1)

Alex Amato
Alex Amato

Reputation: 1725

I don't think that there is any built in way to do this today with beam. Python beam supports Gzip, bzip2 and deflate.

Option 1: Read in the whole files and decompress manually

  1. Create a custom source to produce a list of filenames (I.e. seeded off of a pipeline option by listing a directory), and emit those as records
  2. In the following ParDo, read each file manually and decompress it. You will need to use a GCS library to read the GCS file, if you have stored your data there.

This solution will likely not perform as fast, and it will not be able to load large files into memory. But if your files are small in size, it might be work well enough.

Option 2: Add a new decompressor to Beam.

You may be able to contribute a decompressor to beam. It looks like you would need to implement the decompressor logic, provide some constants to specify it when authoring a pipleine.

I think one of the constraints is that it must be possible to scan the file and decompress it in chunks at a time. If the compression format requires reading the whole file into memory, then it will likely not work. This is because the TextIO libraries are designed to be record based, which supports reading large files that don't fit into memory and breaking them up into small records for processing.

Upvotes: 2

Related Questions