Reputation: 25
I am trying to write an ingestion application using GCP services. There could be around 1 TB of data each day which can come in a streaming way (i.e, 100 GIG each hour or even by once at a specific time)
I am trying to design an ingestion application, I first thoght it is a good idea to write a simple Python script within a cron job to read files sequentiallly (or even within two three threads) and then publish them as a message to pub/sub. Further I need to have a Dataflow job running always read data from pub/sub and save them to BigQuery.
But I really want to know If I need pub/sub at all here, I know dataflow could be very flexible and i wanted to know can I ingest 1 TB of data directly from GCS to BigQuery as batch job, or it is better to be done by a streaming job (by pub/sub) as I told above? what are the pros cons of each approach in terms of cost?
Upvotes: 1
Views: 196
Reputation: 1294
It seems like you don't need Pub/Sub at all.
There is already a Dataflow template for direct transfer of text files from Cloud Storage to BigQuery (in BETA just like the Pub/Sub to BigQuery template) and in general, batch jobs are cheaper than stream jobs (see Pricing Details).
Upvotes: 1