Tobi
Tobi

Reputation: 936

Is Dataflow making use of Google Cloud Storage's gzip transcoding?

I am trying to process JSON files (10 GB uncompressed/2 GB compressed) and I want to optimize my pipeline.

According to the official docs Google Cloud Storage (GCS) has the option to transcode gzip files, which means the application gets them uncompressed, when they are tagged correctly. Google Cloud Dataflow (GCDF) has better parallelism when dealing with uncompressed files, so I was wondering if setting the meta tag on GCS has a positive effect on performance?

Since my input files are relatively large, does it make sense to unzip them so that Dataflow splits them in smaller chunks?

Upvotes: 2

Views: 405

Answers (1)

Pablo
Pablo

Reputation: 11041

You should not use this meta tag. It's dangerous, as GCS would report the size of your file incorrectly (e.g. report the compressed size, but dataflow/beam would read the uncompressed data).

In any case, the splitting of uncompressed files relies on reading in parallel from different segments of a file, and this is not possible if the file is originally compressed.

Upvotes: 2

Related Questions