How to Enable/Disable a Lambda Trigger on a Schedule

This is the kind of topic that makes me think “why would we want to do that?” and “is there something else we can do instead?”. However, the topic does come up occasionally, and you’re interested enough to be reading this post, so let’s discuss and look at a couple of solutions.

I’m using ‘trigger’ as a short-hand for the event source mappings that integrate Lambda with SQS, DynamoDB Streams, Kinesis, Kafka, etc.

Why do I try to avoid this?

Let’s start with some reasons why I usually try to avoid enabling and disabling Lambda event source mappings on a schedule.

  • There isn’t a native way to do it, so any solution will add complexity and risk by requiring extra infrastructure and potentially code.
  • Disabling and enabling triggers will likely affect quality attributes such as latency and availability in unintuitive ways.
  • A pause in processing could lead to an insurmountable backlog if not properly accounted for.
  • Scheduled changes to infrastructure and processing behaviour will make the system more difficult to reason about.
  • Any bursts of activity upon re-enabling the trigger could cause more downstream problems than the ones being prevented.
  • Wanting to do something like this is often an indication that something else could be improved instead.

Why would we do this?

Of course, these are just tradeoffs and you may be happy to accept them. Let’s look at some example reasons why one may consider doing so.

Prevent a downstream component from being overloaded

For example, the bulk of your user activity happens around lunch time and any outage would be especially costly then.

Something can only be done during specific times or on specific days

Maybe trades can only be submitted during particular opening hours on weekdays when the market is open.

An activity should only happen during specific times

You only want to send invoices during the night. This is similar to the above, but more of a business rule than a strict requirement.

Batching up work

Perhaps it’s far more efficient and cheap to process items as a large batch than individually, so you let them queue up over time (longer than the 5 minutes aleady supported for sources like SQS).

Third party requirements

It’s them, not you. An external service provider has asked you not to call their API during their weekly scheduled maintenance window.

Your unique reason

Did you find this blog post because you have a reason? What is it? I’d love to know. Please comment!

Is there something else we can do instead?

Some of the above reasons may well be valid, including yours, but let’s take a look at the one I find most questionable: prevent a downstream component from being overloaded. Below are some scenarios and some potential solutions.

Shared resources

Your service shares some resource with another service. This may be the result of how a monolith was broken up into services. For example, perhaps your company has a legacy shared database.

If the services access different data, can you split the database into two? If the services read (but not write) the same data, maybe each service could have its own copy of the data. If one of the services is writing shared data, the other service could be kept up to date using event-carried state transfer (as long as it can handle eventual consistency). If they’re both writing and need strong read consistency, maybe they shouldn’t be separate services and you could reconsider your service boundaries.

Shared limits

This is especially applicable in cloud environments like AWS. A good example is Lambda concurrency where each account gets 1,000 concurrent invocations by default. This limit is shared between all functions in the account and invocations are throttled when its hit, so functions belonging to different services can intefere with each other.

Throttled invocations may be fine for some background workloads, but you probably don’t want Lambda to throttle functions responding to HTTP requests via API Gateway, for example.

In this Lambda example, you could use reserved concurrency to limit consumed capacity or to ensure there is enough concurrency available. Depending on your situation, maybe you could move services into separate AWS accounts so they’re no longer coupled to the same limits. You could also ask AWS to increase the limit in some cases.

Synchronous messaging

Services that communicate with another service using synchronous request/response messaging such as HTTP are temporally coupled because the other service must be available for the first to do its work. Moreover, the other service must have capacity for your workload as well as any other. An increase in activity from one service can easily intefere with another.

The easiest solution to this problem is to break the temporal coupling by introducing an SQS queue between the two services to act as a buffer. This assumes the communication doesn’t need to be synchronous and can handle additional latency. If that’s the case, then a buffer will let the service queue up work and respond its own time.

If synchronous messaging is actually required, you could consider sharding or partitioning the service. This isolates clients by dedicating a portion of the available resources to each. Sharding, partitioning, and messaging are large topics themselves, so I won’t go into detail.

Asynchronous messaging contention

Maybe there is already a buffer between your service and the other one. Your service does its work and sends messages that are queued up for processing. However, your messages are drowning out another service’s more important messages. This is causing significant latency on processing the other services messages during peak hours. Maybe this is resulting in stale data being served, or orders being processed too slowly.

In this example, implementing priority queueing could allow more important messages to be processed first. AWS have an interesting blog post on this titled Implementing priority queueing with Amazon DynamoDB. Another option is to use different queues for different priority levels. This can be done using SNS message filtering or EventBridge event patterns to route messages differently, or you could simply ask clients to send messages to the right queue. Another option is to triage messages in one queue by moving them to other queues assuming the triage operation is cheaper than actually doing the work. In any case, the idea is that the resources dedicated to processing high priority messages are not consumed by lower priority messages.

Short term vs long term

Just because there is something else you could do, doesn’t always mean you should do that right now. If your situation is causing problems and costing you money, you could implement a quick short term fix before working on a longer term solution.

Solution

So, you have a genuinely valid reason, you’re interested in a short-term fix, or you only started reading from here: how can you enable and disable an event source mapping on a schedule?

Let’s look at an “only do work at night” scenario where we want our event source mapping to be enabled every night between 01:00 and 06:00.

Enabling and disabling

Lambda’s UpdateEventSourceMapping API action is used to enable/disable. This action takes the UUID of an event source mapping and everything else is optional. We’re only interested in updating the Enabled property.

At 01:00 we want to call this action and set Enabled to true.
At 06:00 we want to call it again and set Enabled to false.

Scheduling

CloudWatch Events Rules and self-triggering EventBridge Rules can be used for recurring actions like we want. These two are actually the same underlying service, but EventBridge provides more features and is the preferred way to manage events, so we’ll use that.

EventBridge Rules support cron expressions with a minute-level granularity. Websites like crontab.guru are helpful for building these arcane expressions. You can also use the EventBridge console to preview the next 10 times the rule will trigger. Note that EventBridge scheulde expressions are in UTC.

We want to schedule enabling for 0 1 * * ? * (01:00 UTC everyday).
We want to schedule disabling for 0 6 * * ? * (06:00 UTC everyday).

The missing piece

EventBridge can’t call the UpdateEventSourceMapping action directly, so we need some sort of glue in between. Infrastructure automation is one of Lambda’s selling points and Lambda can be targetted by EventBridge, so it’s an obvious choice. We’ll look at that first, then an interesting alternative.

Lambda

This option involves writing a small function that calls the Lambda API to enable or disable the event source mapping.

In my proof of concept, I made a basic Node.js function that takes an event source mapping UUID and a new value for Enabled, then simply calls updateEventSourceMapping. This code is basic enough to be included inline in a CloudFormation template which simplifies deployment.

The function’s execution role needs the lambda:UpdateEventSourceMapping action for any event source mapping it will be used for. You could have one function per event source mapping if you want to grant least privilege. You must also grant EventBridge permission to invoke your new function.

EventBridge rule targets can be configured with a constant JSON text value that will be passed to the function. For the example below, I found the UUID by running list-event-source-mappings (try using the AWS CloudShell).

Pros and cons

This approach is simple, cheap, and uses Lambda for one of its intended purposes. If you’re used to operating Lambda functions, then you’ll have no trouble here.

On the other hand, any extra code you deploy is a liability since each line could contain a bug.

I think the pros outweigh the cons and that this is a perfectly reasonable implementation. Having said that, let’s look at the other solution.

API Gateway

As of writing, API Gateway can act as a proxy for 104 other AWS services. These service integrations are very flexible, allowing you to call actions on the other services without writing code, including UpdateEventSourceMapping.

To work out how to configure the service integration, you look at the Request Syntax section of the action documentation. Specifically, we’re interested the:

  • HTTP method: PUT,
  • Path: /2015-03-31/event-source-mappings/UUID,
  • Headers: Content-type: application/json, and
  • Body: { "Enabled": boolean }

There are many different ways to configure API Gateway based on how you want to call it and how generic/reusable it should be. You could use query string parameters, headers, or a body mapping template, for example.

In my proof of concept, I included the event source mapping UUID in the path override, set the HTTP method to PUT, and kept content handling as its default so EventBridge can vary the body based on whether I’m enabling or disabling.

Over in EventBridge, we need to make some changes:

  • Set the rule target to be API Gateway.
  • Configure the right API, Deployment Stage, and Integration Target.
  • Add a header with key Content-Type and value application/json.
  • Change the constant JSON text. This will now be used for HTTP body and be proxied to UpdateEventSourceMapping, so it must suit the Lambda API:

Security

API Gateway supports three different endpoint types; edge-optimized, regional, and private. There is no point using an edge-optimised endpoint since both EventBridge and Lambda are regional. Since EventBridge doesn’t support private API endpoints, we should use a regional endpoint.

Regional endpoints are publicly accessible by default and obviously we don’t want anyone on the internet being able to enable or disable anything. Luckily, this is easy to solve by enabling IAM authorization in API Gateway.

With Authorization set to AWS_IAM, calls to the API now need to be signed with AWS credentials. Unless you additionally configure a Resource Policy, the credentials must be from the same account and allow execute-api:Invoke on the API.

Once configured with an IAM role similar the one below, EventBridge will handle signing requests and be able to invoke your API again.

With this configuration, your API is secured by IAM basically the same way the underlying Lambda API is.

Pros and cons

Using an API Gateway service integration instead of a Lambda means you aren’t deploying any extra code. It’s still serverless, so you aren’t paying anything between invocations. Once you understand service integrations, this solution is just as simple, if not simpler, than the Lambda option.

Per request, API Gateway REST APIs are more expensive than Lambda functions, but we’re talking fractions of a cent per month here. Speaking of pricing, note that you’re not charged for requests that fail authorization.

Example code

In both the Lambda and API Gateway solutions, you may want to do a little more before calling them production worthy. For example, error handling, logging, alarms, tuning the EventBridge retry policy, etc.

I’ve created a GitHub repository with an example CloudFormation template for each of the two solutions. The templates both create an example function with an SQS event source mapping that gets enabled at 01:00 UTC and disabled at 06:00 UTC. lambda-option-cf.yml implements the Lambda solution, and apigateway-option-cf.yml implements the API Gateway one.

For more like this, you can follow me on Medium and/or Twitter.

Principal Engineer @ Just Eat Takeaway.com | AWS Community Builder