Reputation: 185
I have a shell script that processes compressed (gzipped) .avro
files stored in GCS and loads them into BigQuery.
Here's the current small setup:
process_file.sh
:
#!/bin/bash
set -e
PROJECT_ID="my-project"
FILE=$1
gsutil cat "$FILE" > "/tmp/$(basename "$FILE")"
bq load --source_format=AVRO --project_id="$PROJECT_ID" project:dataset:table "/tmp/$(basename "$FILE")"
Then I run this:
gsutil ls gs://path/*.avro | nohup parallel -j5 ./process_file.sh {} > parallel_task.log 2>&1 &
The reason this works is that gsutil
(or any wrapper) automatically decompresses a file when it is downloaded locally. For cloud-to-cloud, I can do:
gsutil cat "$FILE" | gunzip | gsutil cp - "${TARGETDIR}${BASENAME}"
This setup works for a small number of files but becomes inefficient and hard to monitor when scaling to thousands of files or more. I need to process about 75GB of files daily (around 1500 files)
The files are compressed (effectively .avro.gz
although they are saved as just .avro
), this is annoying because I can't just load them into bq automatically and have to go through the middle step of decompressing them first.
Question: what are some ways of scaling such processes on GCS? I'm new to cloud infrastructure in general and don't have much data engineering experience.
I'm interested in a simple process that won't incur too big a cost.
I've looked into Dataflow, but it seems like that's overkill for such a simple thing. They do have a bulk decompress pipeline, but it relies on file extension to identify compression type, so I would have to batch-rename my files to .avro.gz
first, which makes it pointless since if I have to run a batch job I might as well include decompression. I tried doing this and it crashed anyway with an error I wasn't able to debug.
I also have a hefty backlog (year's worth) that I'm looking to start processing at scale once I figure out a way to do it at all. So that's the target scale here.
This seems like such a standard problem (batch decompress files and upload to BQ or wherever), yet I haven't been able to find a satisfactory solution!!!
I would appreciate any suggestions.
Upvotes: 0
Views: 149
Reputation: 176
Bigquery supports batch loading Avro files directly as long as it is compressed using a supported codec (snappy, deflate, zstd). Since you are using gzip, creating a function that will fetch the files and decompress the contents is indeed the nearest solution, but the issue you’ve encountered when using a function might be due to network bandwidth and maximum execution time since the process involves decompressing a lot of files. As mentioned by @somethingsomething, it would be helpful to post your code so that we can take a closer look at what went wrong.
You can take a look at this thread about loading a jsonl.gz file from GCS into Bigquery using Cloud Function.
However, given your scale (75GB of files daily), Dataflow might be a better solution since there is a template that decompresses a batch of files on GCS.
Upvotes: 1
Reputation: 2189
This is more of a comment, but it's too long:
Notes:
Since it's gzip you may or may not be cheaper off by not unzipping yourself and leveraging https://cloud.google.com/storage/docs/transcoding , you will most likely need to update the metadata of all files for that though, which is a class A operation per object
make sure that whatever you use is running in the same region as your bucket, as otherwise you will pay data transfer costs on all data.
Upvotes: 1