Caching AWS CDK Docker Builds
From minutes, to seconds, to no time at all. This is a story about how I made my CDK build infinitely faster by caching Docker builds.
🤔 Background
I recently started using Amazon Bedrock knowledge bases along with Amazon OpenSearch Serverless collections (which are not actually serverless since they don’t scale to zero, but I digress).
Unfortunately, AWS CDK support for these two services is currently non-existant beyond L1 constructs, which are the ones that start with Cfn
and map directly to CloudFormation resources.
Using L1 constructs often results in writing additional code, they have limited type safety (e.g. strings or numbers instead of enums), have no defaults, and rely on CloudFormation to apply best practices (which isn’t always a good idea).
I could just suck it up and use the L1 constructs, but CloudFormation support for these services also has some missing pieces. One important piece that is currently missing is the ability to manage OpenSearch Serverless indexes.
As long as the API supports managing these resources, it’s possible to use CustomResources to sort of polyfill CloudFormation. If the operation only requires a single API call to do a Create, Update or Delete, you can make things even easier by using an AwsCustomResource. Both of these options mean extra code, and all code is a liability that needs to be maintained.
In order to avoid all this custom work, I started looking around for some pre-made CDK constructs I could use. AWS provides Construct Hub where I should theoretically be able to find something useful if it exists. Searching for bedrock only returned one library about agents, and opensearch serverless returned a lot of results (probably because of serverless), none of which seemed relevant.
I then tried Google and found many results that pointed me to the @awslabs/enerative-ai-cdk-constructs back on Construct Hub. This library is made by AWS Labs, so it has some weight behind it. It seems to have many of the L2 resources I’m looking for an even some L3 resources I’ll investigate in the future. The code is on GitHub which is always a good sign too.
I’m not sure if AWS Labs constructs ever make it directly into AWS CDK, but I suspect the genuine ones will be very similar in any case, so future migration should be easy enough.
I installed the library and wrote some code. Done, right? Well, yes, but also no.
🐌 The Problem
I tried to do a cdk deploy
and it failed because Docker wasn’t running. From the error I could see that it was the new library trying to do something. I don’t use Docker much, but I have it installed, so I started it up and tried again. This time it worked, but wow was it slow.
Docker builds usually download a base image then run some commands and out pops a new image. From the logs I could see generative-ai-cdk-constructs’s build was downloading a base image, doing some Python package manager stuff, creating a temporary container from the image, then copying files out of the container into the .cdk.out
assets directory. The first time I ran this, It took a few minutes.
The second time I ran it, Docker had already cached the base image and the steps to install and run the package manager (Poetry). Installing the dependencies was was also cached. The slowlest part was actually copying the 31.4MB of files out of the container. Timing the command says it takes 12 seconds.
12 seconds may seem fine, but there’s two big problems. Firstly, this process runs every time CDK synthesizes the app. This means if I want to do a ckd deploy --hotswap
to just swap out some Lambda code or whatever, it takes an additional 12 seconds. That’s not very hot.
The second problem is that when I run my deployment in GitHub Actions, the runner starts with a clean slate. It doesn’t have the base image or other steps cached like my computer does, so it adds minutes to the build. This costs money since runner time is not free, and slows things down. I deploy and destroy every pull request and don’t want to add minutes to that.
🕵️Inspecting The Code
I did a lot of digging around to see what both generative-ai-cdk-constructs and AWS CDK are doing under the hood. I won’t talk about everything, but I’ll link to the code in case you’re interested. In any case, it’s important to have a general idea to follow along.
generative-ai-cdk-construct is creating a CustomResource to polyfill managing OpenSearch Serverless index resources, which means they need to create a Lambda function and that function needs code.
The CustomResource they use is named opensearch-serverless-custom-resources
. That one is located here. It’s some Python files and a Dockerfile. which is referenced here using lambda.Code.fromDockerBuild(codePath)
.
fromDockerBuild
is how you tell CDK to use Docker to build your code, but not to actually return a Docker image. Instead, it should copy the files out of the Docker container and zip them up as a simple Lambda function.
Inside lambda.Code.fromDockerBuild
is this:
const assetPath = cdk.DockerImage
.fromBuild(path, options)
.cp(imagePath, options.outputPath);
cdk.DockerImage.fromBuild
is defined here. It basically just constructs a CLI command and calls docker build
. DockerImage.cp
is in the same file and again just calls the Docker CLI to copy files out of the container.
🙅♂️Attempted Solutions
In GitHub Actions, there’s a concept of caches. This is 10GB of storage you can save to and restore from between builds. It’s very common to use actions/cache to save a copy of your NPM packages between builds to avoid having to download them every time.
This feels like a similar problem, and I thought it’d be nice to use something GitHub-native, so I started investigating that. It wasn’t super clear where the Docker cache was actually located, nor whether the cache can actually be manually copied and restored, nor whether it’s going to be excessively large (there were some people on the internet suggesting caching images is counter-productive due to their size).
Searching Google for Docker caching mostly came up with links to the docker build
docs which described the --cache-to
and --cache-from
arguments. One use for these arguments was type=local
which made it read and write a portable cache to a specific folder. This sounded like what I wanted.
I tested it out locally and it did work. At least, it wrote to the folder and the logs said it read from it. I monkey patched these arguments into the docker build
command AWS CDK was running and pushed it to GitHub. It failed. Apparently, the version of Docker that runs in GitHub Actions doesn’t support the local cache.
I could have persued this and installed a different version of Docker. It probably would have worked, but I wasn’t super happy with the solution. It was a bit messy and it still added 12 seconds to the build. That’s not much in a CI environment, but it still sucks locally and I’m a perfectionist.
In my Googling, I learned about docker buildx
. Apparently it’s some sort of extensions. Instead of docker build …
, you run docker buildx build …
and have access to more functionality. This has an experimental module that actually uses GitHub Actions cache directly. That sounded cool, but it seemed to involve a lot more changes, and again, it would still add at least 12 seconds.
A protip when hacking workarounds into things is to keep them as simple as possible. That way they’re more likely to work the way you intended and less likely to break when something else changes.
👍This works
Okay, so back to the drawing board. Or the thinking desk since I just sat there thinking. The best solution is to completely avoid calling Docker.
The end result of lambda.Code.fromDockerBuild(codePath)
is a folder containing files (the Python source files and dependencies).
Those files are then just zipped up by CDK for deployment. Calling lambda.Code.fromAsset(codePath)
results in the same thing, but assumes the code is already built (and optionally already zipped).
I decided that I could just take a copy of fromDockerBuild
's output from .cdk.out
and use that with fromAsset
. I copied and zipped the files with maximum compression, taking them from 31.4MB down to 16.1MB. That’s small enough to store in GitHub in my opinion, so I just put it in my CDK app folder.
Next I needed to monkey patch this in. Here’s the code I used. I’ll talk about it below.
import { join, sep } from "path";
import { FileSystem } from "aws-cdk-lib/core";
const target =
"node_modules/@cdklabs/generative-ai-cdk-constructs/lambda/opensearch-serverless-custom-resources"
.split("/")
.join(sep);
const zip = "./opensearch-serverless-custom-resources.zip";
const lambda = require("aws-cdk-lib/aws-lambda");
const originalFromDockerBuild = lambda.Code.fromDockerBuild;
lambda.Code.fromDockerBuild = function (path: string, options = {}) {
if (path.normalize().endsWith(target)) {
const fingerprint = FileSystem.fingerprint(path, {
extraHash: JSON.stringify(options),
});
if (
fingerprint ===
"d148788e4ddef70e72ffbf26966cc3995c102984214fe18ad4664bf315728d4f"
) {
return lambda.Code.fromAsset(join(__dirname, zip));
} else {
throw new Error(
`docker-cache-patch: One or more files in ${target} have changed. We need to remake ${zip}.`
);
}
}
return originalFromDockerBuild(path, options);
};
The actual monkey patch is done by requiring aws-cdk-lib/aws-lambda
, storing a reference to the original lambda.Code.fromDockerBuild
, then replacing it with my own function.
My function checks that CDK is about to build opensearch-serverless-custom-resources
. If it’s not, it just forwards the call on to the original function. The check uses path.normalize
so that I can define target
using one style of path separators (/
) but run the code on any OS.
I then hash the files in the source folder. FileSystem.fingerprint
is a CDK utility to do this and is what fromDockerBuild
is doing internally. Interally, CDK does this so that it can more easily check if the source code has changed and avoid unnecessary deployments. According to code comments, there’s some timestamps involved in Docker builds that change Docker’s hash on every build, even when the code hasn’t changed.
Anyway, I fingerprint it because if cdk-labs makes a change to the custom resource code, I don’t want to continue deploying old code. In fact, I’ve made it throw an error. I originally had it fall back to building the code, thinking I’d just fix it when I notice the slow build, but this way it will fail quickly and be resolved sooner.
If the fingerprint is still the same, I just call lambda.Code.fromAsset
and point it at my pre-prepared zip file. This code is used simply by adding import “../docker-cache-patch”;
anywhere before the generative-ai-cdk-constructs package is used (I just put it at the top of my stack’s file).
😀 Conclusion
This solution works perfectly! My synth
builds are back to taking milliseconds.
I don’t think cdk-labs will change this code much, but it’s possible. If it becomes a pain, there are improvements that can be made:
For example, I think a more reobust solution could involve letting CDK run the Docker build the first time and having my code automatically take a copy of the resulting files. In GitHub Actions, those files can be stored in the cache using the fingerprint, and next run they would already be there. This way, if the code changes, it would self-heal after one slower build.
What I’ve got will do just fine for now, though.