Spine Feast
Spine Feast

Reputation: 185

Bulk decompressing files on GCS

I have a shell script that processes compressed (gzipped) .avro files stored in GCS and loads them into BigQuery.

Here's the current small setup:

process_file.sh:

#!/bin/bash
set -e
PROJECT_ID="my-project"
FILE=$1

gsutil cat "$FILE" > "/tmp/$(basename "$FILE")"
bq load --source_format=AVRO --project_id="$PROJECT_ID" project:dataset:table "/tmp/$(basename "$FILE")"

Then I run this:

gsutil ls gs://path/*.avro |  nohup parallel -j5 ./process_file.sh {} > parallel_task.log 2>&1 &

Question: what are some ways of scaling such processes on GCS? I'm new to cloud infrastructure in general and don't have much data engineering experience.

This seems like such a standard problem (batch decompress files and upload to BQ or wherever), yet I haven't been able to find a satisfactory solution!!!

I would appreciate any suggestions.

Upvotes: 0

Views: 149

Answers (2)

yannco
yannco

Reputation: 176

Bigquery supports batch loading Avro files directly as long as it is compressed using a supported codec (snappy, deflate, zstd). Since you are using gzip, creating a function that will fetch the files and decompress the contents is indeed the nearest solution, but the issue you’ve encountered when using a function might be due to network bandwidth and maximum execution time since the process involves decompressing a lot of files. As mentioned by @somethingsomething, it would be helpful to post your code so that we can take a closer look at what went wrong.

You can take a look at this thread about loading a jsonl.gz file from GCS into Bigquery using Cloud Function.

However, given your scale (75GB of files daily), Dataflow might be a better solution since there is a template that decompresses a batch of files on GCS.

Upvotes: 1

somethingsomething
somethingsomething

Reputation: 2189

This is more of a comment, but it's too long:

  1. Dataflow, your remark about the template requiring extensions is weird, you could just modify the template, which on first sight seems to be a one line change
  2. I would just set up a simple cloud function for this on a pubsub topic and trigger all files by posting a message to the topic, however getting a list of objects may be non-trivial if there are a LOT of objects in the bucket. Your remark notes that you tried something like that, but nobody can help you if you don't post your code.

Notes:

  • Since it's gzip you may or may not be cheaper off by not unzipping yourself and leveraging https://cloud.google.com/storage/docs/transcoding , you will most likely need to update the metadata of all files for that though, which is a class A operation per object

  • make sure that whatever you use is running in the same region as your bucket, as otherwise you will pay data transfer costs on all data.

Upvotes: 1

Related Questions