GCP Cloud Functions vs Dataflow for ETL from URL to BigQuery

Am using the following ETL pipeline to get data into BigQuery. Data source are .csv & .xls files from a URL posted daily at 3 pm. Cloud Scheduler publishes a message to a cloud pub/sub topic at 3:05 pm. Pub/Sub pushes/triggers the subscribers-cloud functions When triggered, these cloud functions (python script) downloads the file from the URL, performs transformations (cleaning, formatting, aggregations & filtering) and uploads it to BigQuery.

Is there a cleaner way in GCP to download files from a URL on a schedule, transform & upload it to BigQuery instead of using cloud scheduler + pub/sub + cloud functions ?

I researched Dataflow but cant figure out if it can do all three jobs (downloading from a URL on a schedule, transforming and uploading it to BQ)

Upvotes: 1

Answers (2)

pmo511

Reputation: 619

I do this kind of thing all the time and I can see why you'd wonder if there's a cleaner way. We use Composer (Ariflow) in GCP. In your scenario we would create one DAG with four sequential taks:

Copy file from URL to local bucket
Load file from local bucket to stage table
Merge stage table into final destination table using bigquery SQL script
Clean up staging table

The composer job would look something like this:

All the code required to load a table end to end is located one DAG/folder.

You do need to pay for, and maintain, a Composer instance on GCP. Would be interesting to see how other companies do this kind of thing?

Upvotes: 0

guillaume blaquiere

Reputation: 75775

In your architecture, Dataflow can only replace the PubSub + Cloud Functions. You still need a scheduler to run a dataflow (based on a template, maybe your custom template).

But, before using dataflow, why do you need it? I'm in charge of a datalake, to ingest data from different sources, but, because each element to ingest are small enough to be kept in memory (of Cloud Run, but it's very similar to Cloud Functions) there is no problem to keep that pattern if it works!!

Upvotes: 3

GCP Cloud Functions vs Dataflow for ETL from URL to BigQuery

Answers (2)

Related Questions