Master Po
Master Po

Reputation: 1697

Parallel polling of the AWS SQS standard queue - Message processing is too slow

I've a module that polls an AWS SQS queue at specified intervals with one message at a time with ReceiveMessageRequest. Following is the method:

public static ReceiveMessageResult receiveMessageFromQueue() {

    String targetedQueueUrl = sqsClient.getQueueUrl("myAWSqueueName").getQueueUrl();
    ReceiveMessageRequest receiveMessageRequest = new ReceiveMessageRequest(targetedQueueUrl)
            .withWaitTimeSeconds(10).withMaxNumberOfMessages(1);
    return sqsClient.receiveMessage(receiveMessageRequest);
}

Once a message is received and processed its get deleted from the queue with the DeleteMessageResult .

public static DeleteMessageResult deleteMessageFromQueue(String receiptHandle) {

    log.info("Deleting Message with receipt handle - [{}]", receiptHandle);
    String targetedQueueUrl = sqsClient.getQueueUrl("myAWSqueueName").getQueueUrl();
    return sqsClient.deleteMessage(new DeleteMessageRequest(targetedQueueUrl, receiptHandle));

}

I've created an executable jar file which is deployed in around 40 instances and are actively polling the queue. I could see each of them receives messages. But in AWS SQS console I can see only the numbers 0, 1, 2 or 3 on the 'in flight messages' column. Why that so even when there are 40+ different consumers are receiving messages from the queue? Also the number of messages available in the queue reduces very slowly.

Following are the configuration parameters of the queue.

Default Visibility Timeout: 30 seconds
Message Retention Period:   4 days
Maximum Message Size:   256 KB
Receive Message Wait Time:  0 seconds
Messages Available (Visible):   4,776
Delivery Delay: 0 seconds
Messages in Flight (Not Visible):   2
Queue Type: Standard
Messages Delayed:   0
Content-Based Deduplication:    N/A 

Why the messages are not getting processed quickly even when there are multiple consumers? Do I need to modify any of the queue parameters or something in the receive message/delete message requests? Please advise.

UPDATE:

All the EC2 instances and the SQS are in the same region. The consumers (jar file that polls the queue) run as part of the start-up script of the EC2 instance. And it is having a scheduled task that polls the queue every 12 seconds. Before I push the messages to the queue I spun up 2-3 instances. (We may have some already running instances at that time - this adds up the number of receivers(Caped to 50) for the queue. On receiving the message it will do some tasks (including some DB operations, data analysis and calculations, report file generation and upload the report to S3 etc..) and It'll take approx. 10-12 seconds. After that's done it deletes the message from the queue. Below image is the screenshot of the SQS metrics for last 1 week (from SQS monitoring console).

SQS Metrics for the targeted Queue for last 1 week

Upvotes: 3

Views: 2657

Answers (1)

Krease
Krease

Reputation: 16225

I'll do the best I can with the information given. More details about your processing loop logic, region setup, and metrics (see below) would help improve this answer.

I've created an executable jar file which is deployed in around 40 instances and are actively polling the queue. I could see each of them receives messages. But in AWS SQS console I can see only the numbers 0, 1, 2 or 3 on the 'in flight messages' column. Why that so even when there are 40+ different consumers are receiving messages from the queue? Also the number of messages available in the queue reduces very slowly.

Why the messages are not getting processed quickly even when there are multiple consumers? Do I need to modify any of the queue parameters or something in the receive message/delete message requests?

The fact that you're not seeing in-flight numbers that correspond more closely with the number of hosts you have processing messages definitely points to a problem - either your message processing is blazing fast (which doesn't seem to be the case) or your hosts aren't doing the work you think they are.

In general, fetching and deleting a single message from SQS should take on the range of a few milliseconds. Without more detail on your setup, this should get you started on troubleshooting. (Some of these steps may seem obvious, but every single one of these was the source of real life problems I've seen developers run into.)

  1. If you're launching a new process for each receive-process-delete, this overhead will slow you down substantially. I'll assume you're not doing this, and each host is running a loop within a single process
  2. Verify your processing loop isn't fatalling and restarting on you (effectively turning it into the above case).
    • I assume you've also verified your processes aren't also doing a bunch of work outside of message processing.
  3. You should generate some client-side metrics to indicate how long the SQS requests are taking on each host.
    • Cloudwatch will partly do this for you, but actual client-side metrics are always useful.
    • Recommend basic metrics the following: (1) receive latency, (2) process latency, (3) delete latency, (4) entire message loop latency (5) success/fail counters
  4. Your EC2 instances (the hosts doing the processing) should be in the same region as the SQS queue. If you're doing cross-region calls, this will impact your latency.
    • Make sure these hosts have adequate CPU/memory resources to do the processing
    • As an optimization, I recommend using more threads per host, and less hosts - reusing client connections & maximizing usage of your compute resources is always better.
  5. Verify there wasn't some outage or ongoing issue when you were running your test
  6. Perform getQueueUrl just once for the lifetime of your app, during some initialization step. You don't need to call this repeatedly, as it'll be the same URL
    • This was actually the first thing I noticed in your code, but it's way down here because the above issues will have more impact if they are the cause.
  7. If your message processing is incredibly short (less time than it takes to retrieve and delete a message), then you will end up with your hosts spending most of their time fetching messages. Metrics on this are important too.
    • In this case, you should probably do batch fetching instead of one-at-a-time.
    • Based on the number of messages in your queue and the comment that it's going slowly, it sounds like this isn't the case.
  8. Verify all of hosts are actually hitting the same queue (and not some beta/gamma version, or older version you used for testing at one point)

Further note:

  • The other answer suggests visibility timeout as a potential cause - this is flat-out wrong. Visibility timeout does not block the queue - it only impacts how long messages remain "in-flight" before another receiveMessageRequest can receive that message.
  • You'd consider reducing this if you wanted to try reprocessing your messages sooner in the event of errors / slow processors.

Upvotes: 4

Related Questions