Reproducing the ‘SQS Trigger and Lambda Concurrency Limit’ Issue

Zac Charles
4 min readJun 9, 2019

--

This is a follow-up to my Lambda Concurrency Limits and SQS Triggers Don’t Mix Well (Sometimes) post from earlier in the year. In that post, I described an issue that happens if you configure a Lambda function with a low concurrency limit and an SQS trigger. Though I’ll try to summarise, I suggest you read that post first to get the full context.

Update 16 January, 2023 — This problem is solved:

Photo by Harshil Gudka via Unsplash

The issue I wrote about exists because the number of messages being read from the SQS queue is not directly connected to the concurrency limit of the Lambda function. Instead, when an SQS trigger is initially enabled, Lambda begins long-polling the queue with five parallel connections. It then adds and removes connections based on the rate at which messages are being sent to the queue. This disconnect can lead to throttling and messages being sent to your dead-letter queue.

It’s an interesting problem that isn’t easy to solve. Simply limiting the polling based on the concurrency limit would heavily impact throughput. On top of that, Lambda functions can have multiple event sources and unpredictable invokes via the Lambda API competing for invocations.

I’m writing this post following a comment left on my earlier post. In the comment, a reader said that I appeared to contradict Jeremy Daly’s post from December last year. Though I was fairly certain that wasn’t the case, I wanted to be sure. AWS could have quietly fixed the issue and I’d rather avoid adding to the out of date information already littering the Internet.

Unfortunately, I think Jeremy just got lucky as I’ve managed to reproduce the issue. This post is intended to provide you with the code and steps to reproduce it yourself.

I’ve published a GitHub repository that contains code which can reliably reproduce the issue by going against the AWS recommendations and doing exactly what you’re not supposed to do. It consists of a Serverless service and a small script.

The Serverless service contains a single Lambda function that just sleeps for 5 seconds. Attached to it is an SQS event source with a batch size of 1. Lastly, there is the queue for the event source, along with a second queue that will receive messages if they’re not deleted before the visibility timeout expires after the first receive.

Deployment

To deploy, install the dependencies via NPM ([1] npm install) and tell Serverless to deploy using the [2]serverless deploy command.

Once the deployment is complete, run [3]serverless info --verbose to get the URL of the newly created queue.

Testing

Once you have the queue URL, [4]copy it to line 14 of test.js. This script simply sends ten messages to the queue using SendMessageBatch.

You can now run the script using [5]node test.js.

Observation

The test script just loaded the queue with messages and Lambda has probably pulled all of them off the queue. You can see this in the SQS console as messages move from Messages Available to Messages in Flight.

Messages in Flight are invisible to receivers until the visibility timeout expires.

As a subsequent experiment, try adding a loop to test.js to send 100 messages to the queue and you’ll see Lambda pick up all of them.

After 30 seconds (the default visibility timeout), some of the messages will fall through to the dead-letter queue. The number isn’t always consistent due to latency and Lambda’s internal retries. The more messages you send, the more obvious the issue is, but 10 is usually enough.

The graphs below show some of the Lambda metrics for a test where 3 of 10 messages went to the dead-letter queue. The Invocations graph matches the 7 successfully handled messages. The were no Errors and the Concurrent Executions never went above 1, as configured. However, there were 22 throttles, which clearly shows Lambda’s internal retries trying to compensate for the issue, which I’ve now dubbed SQS overpull.

Conclusion

SQS overpull is still happening and you should be aware of it when setting low concurrency limits. As far as I know, the advice I gave in the What to do about it section of my previous post is still the best you can do to avoid problems.

For more like this, please follow me on Medium and Twitter.

--

--