Reputation: 936
I am trying to process JSON files (10 GB uncompressed/2 GB compressed) and I want to optimize my pipeline.
According to the official docs Google Cloud Storage (GCS) has the option to transcode gzip files, which means the application gets them uncompressed, when they are tagged correctly. Google Cloud Dataflow (GCDF) has better parallelism when dealing with uncompressed files, so I was wondering if setting the meta tag on GCS has a positive effect on performance?
Since my input files are relatively large, does it make sense to unzip them so that Dataflow splits them in smaller chunks?
Upvotes: 2
Views: 405
Reputation: 11041
You should not use this meta tag. It's dangerous, as GCS would report the size of your file incorrectly (e.g. report the compressed size, but dataflow/beam would read the uncompressed data).
In any case, the splitting of uncompressed files relies on reading in parallel from different segments of a file, and this is not possible if the file is originally compressed.
Upvotes: 2