Reputation: 1735
We have a large number of compressed files stored in a GCS bucket. I am attempting to bulk decompress them using the provided utility. The data is in a timestamp directory hierarchy; YEAR/MONTH/DAY/HOUR/files.txt.gz
. Dataflow accepts wildcard input patterns; inputFilePattern=gs://source-data/raw/nginx/2019/01/01/*/*.txt.gz
. However the directory structure is flattened at output. All the files are decompressed into a single directory. Is it possible to maintain the directory hierarchy using the bulk decompressor? Is there another possible solution?
gcloud dataflow jobs run gregstest \
--gcs-location gs://dataflow-templates/latest/Bulk_Decompress_GCS_Files \
--service-account-email [email protected] \
--project shopify-data-kernel \
--parameters \
inputFilePattern=gs://source-data/raw/nginx/2019/01/01/*/*.txt.gz,\
outputDirectory=gs://uncompressed-data/uncompressed,\
outputFailureFile=gs://uncompressed-data/failed
Upvotes: 0
Views: 1397
Reputation: 3893
I have looked for Java code of bulk decompressor and the PipelineResult method does following steps:
It looks like API decompress only files, not directories with files. I recommend you to check this thread on Stackoverflow with possible solutions concerning decompress file in GCS.
I hope you find the above pieces of information useful.
Upvotes: 1