Wilhelm Uschtrin
Wilhelm Uschtrin

Reputation: 337

Why does our GCP Cloud Function receive so many duplicates from its PubSub subscription?

Nutshell

There is a Google Cloud PubSub topic which we publish 2M tiny/small messages to. We set up a Cloud Function to process messages coming through this topic. When inspecting the logs, we see that a lot of these messages are processed multiple times. All in all we have 150-200% the number of messages coming out that went in (see screenshot at the end).

The questions are: Why is this? And how do we properly configure things to get less duplicates?

Additional information

Update 15.03.22

So I wanted to be absolutely certain this is "true" duplication, by which I mean it's PubSub delivering messages/events multiple times, instead of duplication clumsily introduced by us somewhere. So I modified our function code to form a Datastore key from our external message ID and context.messageId and write it to the Datastore with a counter. In case the key already exists the counter is incremented. Right afterwards I log the entity. Here are the stats for 458,940 executions:

|-------------------|-------------------|
|      Counter      |       Logs        |
|-------------------|-------------------|
|         1         |       208,733     |
|         2         |       101,040     |
|         3         |       62,965      |
|         4         |       37,156      |
|         5         |       20,583      |
|         >5        |       28,463      |
|         >15       |       20          |
|         >20       |       0           |
|-------------------|-------------------|

My only (crummy) lead

The only theory I have right now is that the duplication is due to the underlying infrastructure responding with 429s because of the instance limit. See the following screenshot.

enter image description here

We don't really notice this anymore because we just filter the corresponding log messages out in the logging console by filtering for log_name, as the infrastructure throwing lots of warnings and errors seems to be expected behaviour. This is why I am also skeptical of this being the reason for the duplication. On the other hand it does look like the infrastructure could send NACKs in these cases.

Although it feels like the expected behaviour would be that the delivery of the original message fails with a 429 (but does not show up in our logs) and is is then re-delivered, possibly being duplicated. And this is not what we are observing, our logs do show duplicate executions.

So I am not sure this is a promising lead at all, but it's all I got right now.

Other Resources

I feel like what we are observing sounds similar to what is described in this question and the docs here, but then this is about "StreamingPull" subscriptions, whereas the GCF managed subscriptions seem to be (special?) "Push" subscriptions. I was excited at first, because it sounded so similar, but it seems it's not applicable.

Screenshots

Screenshot showing message duplication between topics:

enter image description here

Subscription config (managed by GCF):

enter image description here

Upvotes: 5

Views: 1399

Answers (1)

Jose Gutierrez Paliza
Jose Gutierrez Paliza

Reputation: 1428

The 429’s errors are happening because there are too many requests or there is too much traffic with the instances. Also, your maximum Instances are 90, the suggested default is 3000.

Additionally, what you can do is to erase the maximum instances or set it to 0 if you don’t want to have any limits. On the other hand, you can add a higher maximum instance number from 90 to 180 and eventually a little bit higher to resolve your problems with the 429’s errors.

Other things that can cause this duplication behavior in your subscription are:

  • Messages in the subscriber’s queue are not being processed and ACK'd within the deadline;

  • The client library continues to (automatically) extend the deadline hoping to be able to process these messages eventually;

  • At some point, after several deadline extensions the client library gives up, so those messages are dropped and redelivered;

  • Messages in Pub/Sub are delivered in batches, so not only the expired messages will be redelivered but also the ones belonging to these messages' batches;

  • The subscriber will have its queue full of duplicates, which slows down backlog consumption;

  • The problem becomes worse and worse since current duplicates will expire as well and generate new duplicates.

Upvotes: 1

Related Questions