Reputation: 26729
When creating a new Dataflow Pub/Sub to BigQuery template it is possible to specify the PubSub topic
. It appears there is no way to provide the existing PubSub subscription
, instead Dataflow template creates a new subscription each time it runs.
As far as I understand the PubSub model, the only way to make sure we continue reading the data from the same place in topic is to reuse the same subscription and there seem to be no such option in here.
What will happen when user wants to re-deploy such a Dataflow template? Are we going to lose all the data between the deployments?
Upvotes: 2
Views: 377
Reputation: 11
As an update, now there's a new template for this use very use case.
gcloud dataflow jobs run $jobname \
--project=$project \
--disable-public-ips \
--gcs-location gs://dataflow-templates-$location/latest/PubSub_Subscription_to_BigQuery \
--worker-machine-type n1-standard-1 \
--region $location \
--staging-location gs://$bucket/pss-to-bq \
--parameters inputSubscription=projects/$project/subscriptions/$subscription,outputTableSpec=$dataset.$table
Upvotes: 0
Reputation: 1672
You're right, the google-provided Pub/Sub to BigQuery template does not support passing a subscription as a parameter (here's an older answer by a googler confirming this). However, it should be easy to edit it so that it does so. You would only need to replace getInputTopic
with a getSubscription
equivalent. In turn, this should be passed to a PubsubIO.readMessagesWithAttributes().fromSubscription
(options.getSubscription())
method (see here) instead of fromTopic
. After creating your new pipeline, you'd need to create and stage your template.
Upvotes: 2