Solving the AWS Lambda timeout limitation with Amazon ECS Fargate
Recently, I read an article about using Fargate for long-running processes. The article explains the fundamentals well, though in my view it’s not touching upon a few important details I expected to see. So in this article I’ll build upon this article and reflect on some possible tweaks.
The next ideas will be demonstrated with Node.js, though any other runtime would work just as fine. In the end, the background APIs used are same — AWS’s Lambda, ECS, etc.
Reflections
For me, the main point of reading the article about the long-running processes was to see details about solving the 5 minutes limitation in AWS Lambda, similar to this older one. Instead, it was focused on the ECS setup and there wasn’t much about the integration reasoning. Also, the example of runTask was a useful start, though it’s not immediately obvious how the environment variables can act as dynamic parameters.
Also, reading around the topic of integrating AWS Lambda with ECS Fargate, I was surprised getFunction wasn’t used in any of the implementations. (At least I didn’t find one) This function returns an address to a signed URL holding a zip archive to a given lambda handler. This handler is the code which could be executed in the container. In my opinion, that information and resource would enable a far thinner implementation approach in the container.
Keep in mind that with serverless-webpack you can package your handlers, libraries and others into a single file containing all dependencies.
Ideas about implementation
After the above-mentioned research and reflections, I came to the following ideas about possible implementation strategy:
- Have the primary AWS Lambda handler. This is the business critical handler which should not fail because of timeout limitations
- Have a secondary AWS Lambda handler which is plugged to the primary via a dead letter queue. Not only SQS is a valid trigger nowadays but it’s also better to start the Fargate service only and if it’s necessary. More specifically, the retry behavior in case of timeout will try to re-run the handler 3 times. If 3 attempts fail, then and only then the secondary handler being a safety net will trigger the container for a long-running task.
- Run the ECS Fargate Docker container when primary handler fails. The container’s wrapper around the primary handler should be as thin as fetching original handler and passing down the original
event
andcontext
without any mutations.
Implementation notes
Update the serverless
service
Because services are based on the serverless
framework, I used this dead letter queue plugin in order to attach the secondary handler to the primary one.
This means that, inside your serverless.yaml
file:
- Enable the DLQ plugin
plugins: - serverless-plugin-lambda-dead-letter
2. Create the dead letter queue in Resources
If you haven’t created this section yet, follow the documentation.
3. Attach the secondary handler to the SQS queue:
4. Attach the secondary handler to the primary one:
5. Update IAM role statement of the service
Details are available in the article which inspired me. Keep in mind that once you override defaults you might need to specify additional permissions which are not related only to the task at hand here.
Here’s an example focused on the ECS’s code implementation
As you can notice, the most interesting part is where you pass environment variables from the secondary handler to the container.
The container will take these variables and start the corresponding handler with the same event
and context
.
Create Docker container and deploy it
AWS ECS setup
Similar the prerequisites section here you will need to setup the ECS task on AWS’s console. When you use the console with the defaults, 2 new roles will be created: AWSServiceRoleForECS
and ecsTaskExecutionRole
. Use these roles and attach the necessary policy privileges onto them depending on which resources you'd like to manage from the container.
For instance you can attach AWSLambdaFullAccess
, AmazonS3FullAccess
and AmazonECSTaskExecutionRolePolicy
(AWS managed policies) on ecsTaskExecutionRole
for a start. Later, you can limit AWSLambdaFullAccess
to lambda:getFunction
for example.
1. Dockerfile
definition
This could be very simple, exposing 2 variable more clearly:
This assumes that your serverless
service is based on Node.js 8.10 runtime.
2. Create a runner.js
file which will execute the primary handler inside the container
Don’t forget to make this file executable:
3. Create a repository for the image on ECS
The process is similar to creating a repository on Github. Check this documentation for having an idea about it.
4. Build the image
$ docker build -t runner .
5. Tag it
$ docker tag runner:latest {accountId}.dkr.ecr.eu-central-1.amazonaws.com/runner:latest
You’d normally get the information about the tag when you have created the repository at ECS.
For more details on tagging the image, see this tutorial.
6. Push the latest image to the repository
$ docker push {accountId}.dkr.ecr.eu-central-1.amazonaws.com/runner:latest
If you experience issues with credentials, you can run the following command to re-take temporary credentials and use them directly without copy-paste:
$ aws ecr get-login --no-include-email | source /dev/stdin
At this point, if you have configured your ECS task to use this container from this repository, running the container will run the runner.js
which on its part will run the primary handler.
Win
Now you have a setup where:
- Primary lambda function can possibly fail because of a timeout issue.
- A secondary lambda function gets triggered with initial
event
andcontext
from the primary. - Secondary lambda function starts a docker container on Fargate with these
event
andcontext
and adds some more information about the location of the handler to run in the container. - The container’s
runner.js
gets executed without any timeout limitations and runs the primary lambda function for as long as it's necessary for the process to complete successfully.
Originally published at kalinchernev.github.io.