I have a shell script that processes compressed (gzipped) .avro files stored in GCS and loads them into BigQuery. Here's the current small setup: process_file.sh : #!/bin/bash set -e PROJECT_ID="my-project" FILE=$1 gsutil cat "$FILE" > "/tmp/$(basename "$FILE")" bq load --source_format=AVRO --project_id="$PROJECT_ID" project:dataset:table "/tmp/$(basename "$FILE")" Then I run this: gsutil ls gs://path/*.avro | nohup parallel -j5 ./process_file.sh {} > parallel_task.log 2>&1 & The reason this works is that gsutil (or any wrapper) automatically decompresses a file when it is downloaded locally. For cloud-to-cloud, I can do: gsutil cat "$FILE" | gunzip | gsutil cp - "${TARGETDIR}${BASENAME}" This setup works for a small number of files but becomes inefficient and hard to monitor when scaling to thousands of files or more. I need to process about 75GB of files daily (around 1500 files) The files are compressed (effectively .avro.gz although they are saved as just .avro ), this is annoying because I can't just load them into bq automatically and have to go through the middle step of decompressing them first. Question : what are some ways of scaling such processes on GCS? I'm new to cloud infrastructure in general and don't have much data engineering experience. I'm interested in a simple process that won't incur too big a cost. I've looked into Dataflow, but it seems like that's overkill for such a simple thing. They do have a bulk decompress pipeline, but it relies on file extension to identify compression type, so I would have to batch-rename my files to .avro.gz first, which makes it pointless since if I have to run a batch job I might as well include decompression. I tried doing this and it crashed anyway with an error I wasn't able to debug. I also have a hefty backlog (year's worth) that I'm looking to start processing at scale once I figure out a way to do it at all. So that's the target scale here. This seems like such a standard problem (batch decompress files and upload to BQ or wherever), yet I haven't been able to find a satisfactory solution!!! I would appreciate any suggestions.

google-cloud-platformgoogle-bigquerycompression

Reputation: 185

Bulk decompressing files on GCS

I have a shell script that processes compressed (gzipped) .avro files stored in GCS and loads them into BigQuery.

Here's the current small setup:

process_file.sh:

#!/bin/bash
set -e
PROJECT_ID="my-project"
FILE=$1

gsutil cat "$FILE" > "/tmp/$(basename "$FILE")"
bq load --source_format=AVRO --project_id="$PROJECT_ID" project:dataset:table "/tmp/$(basename "$FILE")"

Then I run this:

gsutil ls gs://path/*.avro |  nohup parallel -j5 ./process_file.sh {} > parallel_task.log 2>&1 &

The reason this works is that gsutil (or any wrapper) automatically decompresses a file when it is downloaded locally. For cloud-to-cloud, I can do:

gsutil cat "$FILE" | gunzip | gsutil cp - "${TARGETDIR}${BASENAME}"
This setup works for a small number of files but becomes inefficient and hard to monitor when scaling to thousands of files or more. I need to process about 75GB of files daily (around 1500 files)
The files are compressed (effectively .avro.gz although they are saved as just .avro), this is annoying because I can't just load them into bq automatically and have to go through the middle step of decompressing them first.

Question: what are some ways of scaling such processes on GCS? I'm new to cloud infrastructure in general and don't have much data engineering experience.

I'm interested in a simple process that won't incur too big a cost.
I've looked into Dataflow, but it seems like that's overkill for such a simple thing. They do have a bulk decompress pipeline, but it relies on file extension to identify compression type, so I would have to batch-rename my files to .avro.gz first, which makes it pointless since if I have to run a batch job I might as well include decompression. I tried doing this and it crashed anyway with an error I wasn't able to debug.
I also have a hefty backlog (year's worth) that I'm looking to start processing at scale once I figure out a way to do it at all. So that's the target scale here.

This seems like such a standard problem (batch decompress files and upload to BQ or wherever), yet I haven't been able to find a satisfactory solution!!!

I would appreciate any suggestions.

Upvotes: 0

Answers (2)

yannco

Reputation: 176

Bigquery supports batch loading Avro files directly as long as it is compressed using a supported codec (snappy, deflate, zstd). Since you are using gzip, creating a function that will fetch the files and decompress the contents is indeed the nearest solution, but the issue you’ve encountered when using a function might be due to network bandwidth and maximum execution time since the process involves decompressing a lot of files. As mentioned by @somethingsomething, it would be helpful to post your code so that we can take a closer look at what went wrong.

You can take a look at this thread about loading a jsonl.gz file from GCS into Bigquery using Cloud Function.

However, given your scale (75GB of files daily), Dataflow might be a better solution since there is a template that decompresses a batch of files on GCS.

Upvotes: 1

somethingsomething

Reputation: 2189

This is more of a comment, but it's too long:

Dataflow, your remark about the template requiring extensions is weird, you could just modify the template, which on first sight seems to be a one line change
I would just set up a simple cloud function for this on a pubsub topic and trigger all files by posting a message to the topic, however getting a list of objects may be non-trivial if there are a LOT of objects in the bucket. Your remark notes that you tried something like that, but nobody can help you if you don't post your code.

Notes:

Since it's gzip you may or may not be cheaper off by not unzipping yourself and leveraging https://cloud.google.com/storage/docs/transcoding , you will most likely need to update the metadata of all files for that though, which is a class A operation per object
make sure that whatever you use is running in the same region as your bucket, as otherwise you will pay data transfer costs on all data.

Upvotes: 1

Bulk decompressing files on GCS

Answers (2)

Related Questions