I’ve been meaning to write about this topic for quite some time after I did an in-house presentation on it at work last year. Then earlier today, Yan Cui published a story titled DynamoDB TTL as an ad-hoc scheduling mechanism which has given me a push. It’s worth reading that first as I’m going to sort of pick up where he left off.
By the way, the title isn’t a poke at Yan, it’s a play on the saying “There is more than one way to skin a cat”.
Scheduling tasks using AWS services is a problem that many teams at Just Eat have needed to solve for a variety of scenarios. The chosen solution usually differs based on whether a task must be scheduled for seconds, hours, or days in the future. Let’s look at some examples with different requirements.
- Seconds — A restaurant’s in-store device is unreachable. If it’s not back online soon, change their status to offline.
- Hours — A customer orders tonight’s dinner on their way to work. We should perform more actions closer to that time.
- Days — A restaurant is closing for a couple of weeks over the holidays. Take them offline now and bring them back online later.
There are plenty more that require additional domain knowledge, but you probably get the point.
Let’s look at the options we have and the limitations of each. I’m going to skip CloudWatch and only briefly touch on DynamoDB TTLs.
This is often one of the first that solutions people come up. You should be familiar with it from reading Yan’s article, but I’ll summarise. The idea is that you add a task to a DynamoDB table and set the TTL timestamp to be the time at which the task should be performed. DynamoDB then deletes the item at that time which triggers a Lambda function via DynamoDB Streams.
It sounds great until someone points out that DynamoDB deletes items within 48 hours of expiration.
That margin of error usually causes people to look for another option. However, if you only care that a task happens after a certain time, but aren’t too picky about how long after, this can actually be a very nice solution.
Yan’s article included some timings that I believe to be flawed (sorry). The test he performed involved inserting some records into a table and tracking their expiration. However, as the documentation says, the exact duration within which an item truly gets deleted after expiration is specific to the nature of the workload and the size of the table. This means that if the table were under load due to more tasks being scheduled, or there were a lot of tasks already scheduled, the result will most likely be very different.
As of writing, you should only base your design on items expiring within 48 hours, not on these test results or any others.
A colleague raised the question of cancelling scheduled tasks by deleting records. This is certainly possible. These manual deletes will still go to the DynamoDB Stream, but you can differentiate them from TTL expirations by looking at the
dynamodb.amazonaws.com(see DynamoDB Streams and Time To Live).
SQS Delay Queues
SQS Queues have an attribute called
DelaySeconds that lets you specify a delay between 0 and 900 seconds (15 minutes), creating a Delay Queue. When set, any message sent to the queue will only become visible to consumers after the configured delay period.
This can be a great fit for some problems. For example, if you want to perform a task five minutes after a message is published to an SNS topic. In that case, you could subscribe your SQS queue to the SNS topic and set a delay of 300 seconds. Assuming you’re processing messages on the queue fast enough, you’ll retrieve messages five minutes after they’re sent.
SQS Message Timers
Delay queues are only good if the same delay should be applied to every message. If you need to vary the delay, you can set the
DelaySeconds value on individual messages instead, which AWS calls Message Timers.
Message timers allow you to adjust the delay based on business logic, configuration, or to account for upstream latency.
The 15-minute upper limit still exists and FIFO queues don’t support delays on individual messages. Another thing to consider is that since the delay is set in the
SendMessage request, you can’t use this when subscribing directly to an SNS topic (though you could put a Lambda function in between).
SQS Visibility Timeout
When a consumer receives a message, the message remains in the queue but is invisible for the duration of its visibility timeout, after which other consumers will be able to see the message. Ideally, the first consumer would handle and delete the message before the visibility timeout expires.
You can set a visibility timeout on a queue (the default is 30 seconds) or when receiving messages (if unset, the queue’s default is used). You can also use
ChangeMessageVisibility to alter a messages visibility after receiving it.
This behaviour can be used to our advantage. One way is to send messages to a queue with the details of a task and the time you want to perform it. Consumers of the message then look at the scheduled time and alter the message’s visibility timeout so it becomes visible again at that time.
The maximum visibility timeout is 12 hours. If you schedule a task for longer than that, it will need to be picked up and delayed multiple times. There is a lot of wasted processing being done, especially in that case.
The most significant limitation is that most standard queues (depending on queue traffic and message backlog) can have a maximum of 120,000 inflight messages. I don’t think this can be increased, so it limits the number of tasks you can schedule. A workaround could be creating multiple queues.
Lastly, the in-flight messages metric is going to be less useful as you won’t be able to differentiate scheduled messages and tasks being performed. You could solve this by moving messages to another queue when their scheduled time arrives.
Step Functions can have a state that pauses the execution for a specified number of seconds or until an absolute time. The accuracy is within a second and executions can be paused for up to a year, providing heaps of flexibility.
Each Step Functions state machine can have up to 1,000,000 concurrent executions, though I think this can be increased by contacting AWS.
The biggest disadvantage of this seemingly perfect approach is its relatively high cost. With Step Functions, you’re charged for transitions between states. A minimalist implementation will need a Wait state to pause the execution, and a Task state to either invoke a Lambda or publish to an SNS topic. That’s three transitions which add up to $0.000075, or $75 for a million executions. That’s not too bad but is high compared to $0.40 for a million SQS messages.
Unlike all of the SQS-based solutions, Step Functions executions can be easily stopped at any time, effectively cancelling the scheduled task.
I’ve actually built a generic wrapper to this solution at work. It’s essentially an API Gateway endpoint that takes a message body, an SNS topic name, and a timestamp. API Gateway is integrated directly with Step Functions to kick off an execution that sleeps until the timestamp, then publishes the message to the topic. Importantly, the API Gateway endpoint is protected by IAM based on the topic being published to.
I hope I’ve shown that not only are there many ways to schedule messages with AWS services but that they all have their pros and cons. There is no right or wrong answer. Just make sure you understand your particular problem, read the documentation, and design your solution within the limitations. Keep in mind that those limitations can and probably will change over time.