JonathanW
JonathanW

Reputation: 141

Dataflow Template Cloud Pub/Sub Topic vs Subscription to BigQuery

I'm setting up a simple Proof of Concept to learn some of the concepts in Google Cloud, specifically PubSub and Dataflow.

I have a PubSub topic greeting

I've created a simple cloud function that sends publishes a message to that topic:

const escapeHtml = require('escape-html');
const { Buffer } = require('safe-buffer');
const { PubSub } = require('@google-cloud/pubsub');

exports.publishGreetingHTTP = async (req, res) => {
    let name = 'no name provided';
    if (req.query && req.query.name) {
        name = escapeHtml(req.query.name);
    } else if (req.body && req.body.name) {
        name = escapeHtml(req.body.name);
    }
    const pubsub = new PubSub();
    const topicName = 'greeting';
    const data = JSON.stringify({ hello: name });
    const dataBuffer = Buffer.from(data);
    const messageId = await pubsub.topic(topicName).publish(dataBuffer);
    res.send(`Message ${messageId} published. name=${name}`);
};

I set up a different cloud function that it triggered by the topic:

const { Buffer } = require('safe-buffer');

exports.subscribeGreetingPubSub = (data) => {
    const pubSubMessage = data;
    const passedData = pubSubMessage.data ? JSON.parse(Buffer.from(pubSubMessage.data, 'base64').toString()) : { error: 'no data' };

    console.log(passedData);
};

This works great and I see it registered as a subscription on the topic.

Now I want to send the use Dataflow to send the data to BigQuery

There appear to be 2 template to accomplish this:

I don't understand the difference between Topic and Subscription in this context.

https://medium.com/google-cloud/new-updates-to-pub-sub-to-bigquery-templates-7844444e6068 sheds a little bit of lights:

Note that a caveat of using subscriptions over topics is that subscriptions are only read once while topics can be read multiple times. Therefore a subscription template cannot support multiple concurrent pipelines reading the same subscription.

But I must say I'm still lost to understand the real implications of this.

Upvotes: 8

Views: 2552

Answers (2)

Matthias
Matthias

Reputation: 5764

Just a side note. As mentioned by @Lauren, the PubSub_To_BigQuery creates a subscription behind the scene, which I call a tempory subscription. That one gets deleted once your DataFlow fails because of any reason or gets stopped. In such case all PubSub messages that are processed by other subscritions (and consumers) meanwhile will get lost since the messages get acknowledged on the (second) subscription and later on removed from the topic potentially before your DataFlow job has been fixed. Therefore, I recommend using the PubSub_Subscription_To_BigQuery template that allows you to fine-tune the error-handling. Or, use the newer feature PubSub BigQuery subscriptions.

https://cloud.google.com/pubsub/docs/bigquery

Alternatively, you may configure a retention policy on the PubSub topic. In such case all message will be retained in the topic and once your DataFlow job with its (temporary) subscription is online again, it will replay (re-deliver) the messages.

https://cloud.google.com/blog/products/data-analytics/pubsub-gains-topic-retention-feature https://cloud.google.com/pubsub/docs/replay-overview

Upvotes: 0

Lauren
Lauren

Reputation: 999

If you use the Topic to BigQuery template, Dataflow will create a subscription behind the scenes for you that reads from the specified topic. If you use the Subscription to BigQuery template, you will need to provide your own subscription.

You can use Subscription to BigQuery templates to emulate the behavior of a Topic to BigQuery template by creating multiple subscription-connected BigQuery pipelines reading from the same topic.

For new deployments, using the Subscription to BigQuery template is preferred. If you stop and restart a pipeline using the Topic to BigQuery template, a new subscription will be created, which may cause you to miss some messages that were published while the pipeline was down. The Subscription to BigQuery template doesn't have this disadvantage, since it uses the same subscription even after the pipeline is restarted.

Upvotes: 15

Related Questions