Avi Glatstein
Avi Glatstein

Reputation: 21

Google PubSub message duplication

I am using The python client (That comes as part of google-cloud 0.30.0) to process messages. Sometimes (about 10% ) my messages are being duplicated. I will get the same message again and again up to 50 instances within a few hours. My Subscription setup is for a 600 seconds ack time but a message may be resent a minute after its predecessor.

While running , I would occasionally get 503 errors (Which I log with my policy_class) Has anybody experienced that behavior? any ideas ?

My code look like

c = pubsub_v1.SubscriberClient(policy_class)    
subscription = c.subscribe(c.subscription_path(my_proj ,my_topic)
res = subscription.open(callback=callback_func)
res.result()

def callback_func(msg)
  try:
     log.info('got %s', msg.data )
     ...
  finally:
     ms.ack()

Upvotes: 1

Views: 1864

Answers (3)

Kamal Aboul-Hosn
Kamal Aboul-Hosn

Reputation: 17161

In general, duplicates can happen given that Google Cloud Pub/Sub offers at-least-once delivery. Typically, this rate should be very low. A rate of 10% would be very high. In this particular instance, it was likely an issue in the client libraries that resulted in excessive duplicates, which was fixed in April 2018.

For the general case of excessive duplicates there are a few things to check to determine if the problem is on the user side or not. There are two places where duplication can happen: on the publish side (where there are two distinct messages that are each delivered once) or on the subscribe side (where there is a single message delivered multiple times). The way to distinguish the cases is to look at the messageID provided with the message. If the same ID is repeated, then the duplication is on the subscribe side. If the IDs are unique, then duplication is happening on the publish side. In the latter case, one should look at the publisher to see if it is getting errors that are resulting in publish retries.

If the issue is on the subscriber side, then one should check to ensure that messages are being acknowledged before the ack deadline. Messages that are not acknowledged within this time will be redelivered. If this is the issue, then the solution is to either acknowledge messages faster (perhaps by scaling up with more subscribers for the subscription) or by increasing the acknowledgement deadline. For the Python client library, one sets the acknowledgement deadline by setting the max_lease_duration in the FlowControl object passed into the subscribe method.

Upvotes: 2

Avi Glatstein
Avi Glatstein

Reputation: 21

This seems to be an issue with google-cloud-pubsub python client, I upgraded to version 0.29.4 and ack() work as expected

Upvotes: 1

Max Dietz
Max Dietz

Reputation: 178

The client library you are using uses a new Pub/Sub API for subscribing called StreamingPull. One effect of this is that the subscription deadline you have set is no longer used, and instead one calculated by the client library is. The client library also automatically extends the deadlines of messages for you.

When you get these duplicate messages - have you already ack'd the message when it is redelivered, or is this while you are still processing it? If you have already ack'd, are there some messages you have avoided acking? Some messages may be duplicated if they were ack'd but messages in the same batch needed to be sent again.

Also keep in mind that some duplicates are expected currently if you take over a half hour to process a message.

Upvotes: 1

Related Questions