Reputation: 7802
I'm analyzing a recent load event on my SQS consumer service and am stuck with some SQS Cloudwatch metrics that don't make sense to me. Essentially, it looks like the queue was getting overloaded with messages that aren't accounted for in the metrics. Let me start by summarizing the data in a selected 5 minute period:
What is baffling me is that the ApproximateNumberOfMessagesVisible is experiencing a gain (+17k) of many times more than the number of messages that were not processed (NumberOfMessagesSent - NumberOfMessagesDeleted = ~6k).
I've included metrics about the number of invisible messages as well (just incase there was a bunch of invisible messages that suddenly became visible), but that doesn't seem to be the case.
How could this be possible?
Upvotes: 2
Views: 952
Reputation: 178956
How does a message become visible?
By being sent to the queue.
By being returned to visible status because the message was received, and not deleted, and thus became visible again because it was not deleted before its visibility timeout expired.
There isn't enough history provided here to conclusively state that SQS's counters are right or wrong, but consider this suggestion from an old comment of mine on Why do SQS Messages Sometimes Remain In-Flight on a Queue:
In Cloudwatch, select both the graph for
NumberOfMessagesReceived
andNumberOfMessagesDeleted
. You should find that one graph perfectly overlays and completely masks the other; if to some extent they don't, it strongly suggests a problem in the library that you are using or in your consumers, which would cause the symptoms you observe.
You can delete a single message from a queue only once, but you can receive a single message multiple times before that occurs, if you have a process that is dropping messages on the floor, accidentally or deliberately. They again become visible and SQS will redeliver them after the visibility timeout expires. If this is happening, the two metrics mentioned above will not line up perfectly over time.
Otherwise, they should -- as should the stats you are seeing.
So, you're right, it doesn't make sense, if your workers are all behaving perfectly and processing and deleting each message on the first attempt.
Note that if you use the AWS console to inspect messages, the two counters I mentioned will not line up cleanly, because the console receives messages and then resets their visibility timeout, just like a normal consumer might, so this will artificially inflate the receive counters compared to the delete counters.
Upvotes: 3