Duplicate rows in BigQuery using Pub/Sub subscription

Question

I'm writing a python script that generates N distinct messages that I am publishing to a topic on Google Cloud Pub/Sub using BatchSettings(max_messages=1_000, max_bytes=1_000_000, max_latency=1). Each message has a unique identifier, either a uuid.uuid4 or an integer from an increasing sequence.

The aforementioned topic has a BigQuery Pub/Sub subscription that writes messages to a table with a given Pub/Sub schema.

When start pushing messages at a rate of 800000 per 20s (40000/s). I observe duplicate rows in the BQ table. Even at a rate of 10000/s, duplication is observed.

The BQ Pub/Sub subscription has the following settings:

Use topic schema : Enabled
Write metadata : Disabled
Drop unknown fields : Disabled
Subscription expiration : Subscription will never expire.
Acknowledgement deadline : 600 seconds
Subscription message retention duration : 7 days
Retain acknowledged messages : No
Exactly once delivery : Disabled (cannot be enabled)
Message ordering: Disabled
Dead lettering: Disabled
Retry policy : Retry after exponential backoff delay
Minimum backoff duration : 60 seconds
Maximum backoff duration : 600 seconds

Is this to be expected? What can I do in order to avoid ending up with duplicate rows?

Thank you for your time.

Kamal Aboul-Hosn · Accepted Answer

Some level of duplication is expected as Pub/Sub has at-least-once delivery semantics, including for BigQuery subscriptions. Typically, we expect the duplicate rate to be around 0.1%.

The first question with regard to duplicates is, are they generated on the publish side or are the on the subscribe side (when writing from Pub/Sub to BigQuery)? To answer this question, enable the write_metadata property on your subscription. Then, look at the uniqueness of the message_ids relative to the uniqueness of the identifier you are putting in the messages. If you have multiple messages with different message_ids, but the same identifier you put in the message, then the duplicates are coming on the publish side. If you have messages with the same message_id, then the duplicates are coming on the subscribe side and fit within the expectation of the at-least-once delivery guarantees provided by Pub/Sub.

If the duplicates are generated on the publish side, then it's possible your publish load is overloading the client in some way and resulting in a lot of deadline exceeded errors, which causes reattempts at publishing. You could consider taking advantage of publisher flow control. By setting PublishFlowControl.message_limit, you can restrict the number of messages awaiting a publish response at a time, which limits the resources needed for publishing and makes it less likely that you will run up against resource constraints that could lead to increased rates of errors and duplicates.

Duplicate rows in BigQuery using Pub/Sub subscription

Answers (1)

Related Questions