stkvtflw
stkvtflw

Reputation: 13587

Why am I getting 50% of GCP Pub/Sub messages duplicated?

I'm running an analytics pipeline.

Here is my topic and the subscription:

gcloud pubsub topics create pipeline-input

gcloud beta pubsub subscriptions create pipeline-input-sub \
    --topic pipeline-input \
    --ack-deadline 600 \
    --expiration-period never \
    --dead-letter-topic dead-letter

Here is how I pull messages:

import { PubSub, Message } from '@google-cloud/pubsub'

const pubSubClient = new PubSub()

const queue: Message[] = []

const populateQueue = async () => {
  const subscription = pubSubClient.subscription('pipeline-input-sub', {
    flowControl: {
      maxMessages: 5
    }
  })
  const messageHandler = async (message: Message) => {
    queue.push(message)
  }
  subscription.on('message', messageHandler)
}

const processQueueMessage = () => {
  const message = queue.shift()
  try {
    ...
    message.ack()
  } catch {
    ...
    message.nack()
  }
  processQueueMessage()
}

processQueueMessage()

Processing time is ~7 seconds.

Here is one of the many similar dup cases. The same message is delivered 5 (!!!) times to different GCE instances:

All 5 times the message was successfully processed and .ack()ed. The output includes 50% more messages than the input! I'm well aware of the "at least once" behavior, but I thought it may duplicate like 0.01% of messages, not 50% of them.

The topic input is 100% free of duplicates. I verified both the topic input method AND the number of un-acked messages through the Cloud Monitor. Numbers match: there are no duplicates in the pub/sub topic.

UPDATE:

  1. It looks like all those duplicates created due to ack deadline expiration. I'm 100% sure that I'm acknowledging 99.9% of messages before the 600 seconds deadline.

Upvotes: 3

Views: 3733

Answers (2)

Sneha Mule
Sneha Mule

Reputation: 715

GCP Pubsub guaranty at least one time message delivery. Link : https://cloud.google.com/pubsub/docs/exactly-once-delivery

So if client (can be dataflow pipeline) does not acknowledge message within Acknowledgement deadline then pubsub will resent the message.

You can setup Acknowledgement deadline while creating subscription. Default pubsub Acknowledgement deadline - 10 Seconds Max pubsub Acknowledgement deadline - 10 Minutes

enter image description here

Upvotes: 0

Kamal Aboul-Hosn
Kamal Aboul-Hosn

Reputation: 17261

Some duplicates are expected, though a 50% duplicate rate is definitely high. The first question is, are these publish-side duplicates or subscribe-side duplicates? The former are created when a publish of the same message is retried, resulting in multiple publishes of the same message. These messages will have different message IDs. The latter is caused by redeliveries of the same message to the subscriber. These messages have the same message ID (though different ack IDs).

It sounds like you have verified that these are subscribe-side duplicates. Therefore, the likely cause, as you mention is an expired ack deadline. The question is, why are the messages exceeding the ack deadline? One thing to note is that when using the client library, the ack deadline set in the subscription is not the one used. Instead, the client library tries to optimize ack deadlines based on client library settings and the 99th percentile ack latency. It then renews leases on messages until the max_lease_duration property of the FlowControl object passed into the subscribe method. This defaults to one hour.

Therefore, in order for messages to remain leased, it is necessary for the client library to be able to send modifyAckDeadline requests to the server. One possible cause of duplicates would be the inability of the client to send these requests, possibly due to overload on the machine. Are the machines running this pipeline doing any other work? If so, it is possible they are overloaded in terms of CPU, memory, or network and are unable to send the modifyAckDeadline requests and unable to process messages in a timely fashion.

It is also possible that message batching could be affecting your ability to ack messages. As an optimization, the Pub/Sub system stores acknowledgements for batches of messages instead of individual messages. As a result, all messages in a batch must be acknowledged in order for all of them to be acknowledged. Therefore, if you have five messages in a batch and acknowledge four of them, but then do not ack the final message, all five will be redelivered. There are some caches in place to try to minimize this, but it is still a possibility. There is a Medium post that discusses this in more detail (see the "Message Redelivery & Duplication Rate" section). It might be worth checking that all messages are acked and not nacked in your code by printing out the message ID as soon as the message is received and right before the calls to ack and nack. If your messages were published in batches, it is possible that a single nack is causing redelivery of more messages.

This coupling between batching and duplicates is something we are actively working on improving. I would expect this issue to stop at some point. In the meantime, if you have control over the publisher, you could set the max_messages property in the batch settings to 1 to prevent the batching of messages.

If none of that helps, it would be best to open up a support case and provide the project name, subscription name, and message IDs of some duplicated messages. Engineers can investigate in more detail why individual messages are getting redelivered.

Upvotes: 3

Related Questions