Reputation: 23
Am using the following ETL pipeline to get data into BigQuery. Data source are .csv & .xls files from a URL posted daily at 3 pm. Cloud Scheduler publishes a message to a cloud pub/sub topic at 3:05 pm. Pub/Sub pushes/triggers the subscribers-cloud functions When triggered, these cloud functions (python script) downloads the file from the URL, performs transformations (cleaning, formatting, aggregations & filtering) and uploads it to BigQuery.
Is there a cleaner way in GCP to download files from a URL on a schedule, transform & upload it to BigQuery instead of using cloud scheduler + pub/sub + cloud functions ?
I researched Dataflow but cant figure out if it can do all three jobs (downloading from a URL on a schedule, transforming and uploading it to BQ)
Upvotes: 1
Views: 1173
Reputation: 619
I do this kind of thing all the time and I can see why you'd wonder if there's a cleaner way. We use Composer (Ariflow) in GCP. In your scenario we would create one DAG with four sequential taks:
The composer job would look something like this:
All the code required to load a table end to end is located one DAG/folder.
You do need to pay for, and maintain, a Composer instance on GCP. Would be interesting to see how other companies do this kind of thing?
Upvotes: 0
Reputation: 75775
In your architecture, Dataflow can only replace the PubSub + Cloud Functions. You still need a scheduler to run a dataflow (based on a template, maybe your custom template).
But, before using dataflow, why do you need it? I'm in charge of a datalake, to ingest data from different sources, but, because each element to ingest are small enough to be kept in memory (of Cloud Run, but it's very similar to Cloud Functions) there is no problem to keep that pattern if it works!!
Upvotes: 3