Gregology
Gregology

Reputation: 1735

GCP Bulk Decompress maintaining file structure

We have a large number of compressed files stored in a GCS bucket. I am attempting to bulk decompress them using the provided utility. The data is in a timestamp directory hierarchy; YEAR/MONTH/DAY/HOUR/files.txt.gz. Dataflow accepts wildcard input patterns; inputFilePattern=gs://source-data/raw/nginx/2019/01/01/*/*.txt.gz. However the directory structure is flattened at output. All the files are decompressed into a single directory. Is it possible to maintain the directory hierarchy using the bulk decompressor? Is there another possible solution?

gcloud dataflow jobs run gregstest \
    --gcs-location gs://dataflow-templates/latest/Bulk_Decompress_GCS_Files \
    --service-account-email [email protected] \
    --project shopify-data-kernel \
    --parameters \
inputFilePattern=gs://source-data/raw/nginx/2019/01/01/*/*.txt.gz,\
outputDirectory=gs://uncompressed-data/uncompressed,\
outputFailureFile=gs://uncompressed-data/failed

Upvotes: 0

Views: 1397

Answers (1)

aga
aga

Reputation: 3893

I have looked for Java code of bulk decompressor and the PipelineResult method does following steps:

  1. Find all files matching the input pattern
  2. Decompress the files found and output them to the output directory
  3. Write any errors to the failure output file

It looks like API decompress only files, not directories with files. I recommend you to check this thread on Stackoverflow with possible solutions concerning decompress file in GCS.

I hope you find the above pieces of information useful.

Upvotes: 1

Related Questions