prosto.vint
prosto.vint

Reputation: 1505

Periodically lost messages in AWS SNS

I know, it is sounds strange, but something is wrong with my AWS SNS =)

I have lambda function, which is sending messages to AWS SNS. Also I have several SQS as subscriptions for my SNS. Also, I have dead queues for SNS and SQS. And turned on logging (100%) for SNS (delivery and errors).

In most cases, my architecture is working as expected - Lambda is sending messages to SNS

  1. I see successful response from SNS in Lambda logs (boto3 / sns client)
  2. I see successful log in SNS logs
  3. I am able to get my message in SQS

But sometimes something is gong wrong between Lambda and SNS, because:

  1. I see successful response in Lambda, something like:
    {'MessageId': '292af724-XXXc49658c0', 'SequenceNumber': '10000000000000000551',
    'ResponseMetadata': {'RequestId': 'ba126582-XXX8f2', 
    'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'ba126582-XXX18f2', 
    'content-type': 'text/xml', 'content-length': '352', 
    'date': 'Thu, 29 Apr 2021 13:00:28 GMT'}, 'RetryAttempts': 0}}
  1. This is all that I have :( No any errors for SNS (it is OK with permissions for SNS, because I saw before failure errors and success delivery messages). No any messages in DLQ SNS/SQS. Nothing :(

So, my question is - how it is possible? And how I can fix it?

REMARK - I am using FIFO SNS / SQS

Upvotes: 1

Views: 3776

Answers (3)

MelancholicLooper
MelancholicLooper

Reputation: 11

I know this question is a little old, but I wanted to answer just in case you or others are facing this issue and are mystified.

Anarki's and DazedAndConfused's answers are both valid points. One angle you should look at is in your code. If your process flow is:

Lambda 1 > SNS > SQS > Lambda 2

Your Lambda 2 function must process the SQS Records in a Loop. Even if Lambda 1 published multiple SNS events, some messages can be batched and will be processed in Lambda 2.

With that being said, it's very possible that you have a return statement in your Lambda 2 loop. In this case, after processing Record[0], your Lambda 2 will terminate, effectively skipping all subsequent Records.

Using Python as an Example

For instance, this:

def lambda_handler(event, context):
    print('Event data is: ' + str(event))


    for record in event['Records']:
        print(record['messageId'])
    
        return {
            "statusCode": 200,
            "body": "Success!"
        }

...is a lot different than this:

def lambda_handler(event, context):
    print('Event data is: ' + str(event))


    for record in event['Records']:
        print(record['messageId'])

    return {
        "statusCode": 200,
        "body": "Success!"
    }

I, having the same confusion and questions as you, unfortunately know and have learned the hard way.

Upvotes: 1

Anarki
Anarki

Reputation: 111

Check that the SQS trigger on your Lambda has a batch size of 1. If your output Lambda is designed to handle exactly 1 request at a time, several items on the queue can be popped off in a group, and give the illusion of being lost. If your Lambda is short and fast, this may be desirable... you just need to be aware that it happens.

I have a slightly different setup to yours, but I thought it would be worth sharing regardless:

  1. Bulk SNS triggering Lambda.
  2. SQS for tracking pending jobs.
  3. SQS trigger.
  4. Consuming Lambda.

So technically you can fix things by changing step 3 or 4 depending on what works for you. The easiest way to change the batch size is to create a new trigger. Although it can be done via the CLI, I don't know the command.

Lambda SQS trigger with reduced batch size.

Upvotes: 0

DazedAndConfused
DazedAndConfused

Reputation: 115

Is it possible you're just seeing the result of deduplication? https://docs.aws.amazon.com/sns/latest/dg/fifo-message-dedup.html

If you use the same deduplication ID or if you have content deduplication switched on then you won't be able to deliver the same message within a 5 minute period.

SNS/SQS have such epic durability that it would be almost impossible to randomly lose messages unless you're processing billions per hour.

Upvotes: 1

Related Questions