Solving the AWS Lambda timeout limitation with Amazon ECS Fargate

4 min readSep 3, 2018

Recently, I read an article about using Fargate for long-running processes. The article explains the fundamentals well, though in my view it’s not touching upon a few important details I expected to see. So in this article I’ll build upon this article and reflect on some possible tweaks.

The next ideas will be demonstrated with Node.js, though any other runtime would work just as fine. In the end, the background APIs used are same — AWS’s Lambda, ECS, etc.

Reflections

For me, the main point of reading the article about the long-running processes was to see details about solving the 5 minutes limitation in AWS Lambda, similar to this older one. Instead, it was focused on the ECS setup and there wasn’t much about the integration reasoning. Also, the example of runTask was a useful start, though it’s not immediately obvious how the environment variables can act as dynamic parameters.

Also, reading around the topic of integrating AWS Lambda with ECS Fargate, I was surprised getFunction wasn’t used in any of the implementations. (At least I didn’t find one) This function returns an address to a signed URL holding a zip archive to a given lambda handler. This handler is the code which could be executed in the container. In my opinion, that information and resource would enable a far thinner implementation approach in the container.

Keep in mind that with serverless-webpack you can package your handlers, libraries and others into a single file containing all dependencies.

Ideas about implementation

After the above-mentioned research and reflections, I came to the following ideas about possible implementation strategy:

Have the primary AWS Lambda handler. This is the business critical handler which should not fail because of timeout limitations
Have a secondary AWS Lambda handler which is plugged to the primary via a dead letter queue. Not only SQS is a valid trigger nowadays but it’s also better to start the Fargate service only and if it’s necessary. More specifically, the retry behavior in case of timeout will try to re-run the handler 3 times. If 3 attempts fail, then and only then the secondary handler being a safety net will trigger the container for a long-running task.
Run the ECS Fargate Docker container when primary handler fails. The container’s wrapper around the primary handler should be as thin as fetching original handler and passing down the original event and context without any mutations.

Implementation notes

Update the `serverless` service

Because services are based on the serverless framework, I used this dead letter queue plugin in order to attach the secondary handler to the primary one.

This means that, inside your serverless.yaml file:

Enable the DLQ plugin

plugins:  - serverless-plugin-lambda-dead-letter

2. Create the dead letter queue in Resources

If you haven’t created this section yet, follow the documentation.

3. Attach the secondary handler to the SQS queue:

4. Attach the secondary handler to the primary one:

5. Update IAM role statement of the service

Details are available in the article which inspired me. Keep in mind that once you override defaults you might need to specify additional permissions which are not related only to the task at hand here.

Here’s an example focused on the ECS’s code implementation

As you can notice, the most interesting part is where you pass environment variables from the secondary handler to the container.

The container will take these variables and start the corresponding handler with the same event and context.

Create Docker container and deploy it

AWS ECS setup

Similar the prerequisites section here you will need to setup the ECS task on AWS’s console. When you use the console with the defaults, 2 new roles will be created: AWSServiceRoleForECS and ecsTaskExecutionRole. Use these roles and attach the necessary policy privileges onto them depending on which resources you'd like to manage from the container.

For instance you can attach AWSLambdaFullAccess, AmazonS3FullAccess and AmazonECSTaskExecutionRolePolicy (AWS managed policies) on ecsTaskExecutionRole for a start. Later, you can limit AWSLambdaFullAccess to lambda:getFunction for example.

1. Dockerfile definition

This could be very simple, exposing 2 variable more clearly:

This assumes that your serverless service is based on Node.js 8.10 runtime.

2. Create a runner.js file which will execute the primary handler inside the container

Don’t forget to make this file executable:

3. Create a repository for the image on ECS

The process is similar to creating a repository on Github. Check this documentation for having an idea about it.

4. Build the image

$ docker build -t runner .

5. Tag it

$ docker tag runner:latest {accountId}.dkr.ecr.eu-central-1.amazonaws.com/runner:latest

You’d normally get the information about the tag when you have created the repository at ECS.

For more details on tagging the image, see this tutorial.

6. Push the latest image to the repository

$ docker push {accountId}.dkr.ecr.eu-central-1.amazonaws.com/runner:latest

If you experience issues with credentials, you can run the following command to re-take temporary credentials and use them directly without copy-paste:

$ aws ecr get-login --no-include-email | source /dev/stdin

At this point, if you have configured your ECS task to use this container from this repository, running the container will run the runner.js which on its part will run the primary handler.

Win

Now you have a setup where:

Primary lambda function can possibly fail because of a timeout issue.
A secondary lambda function gets triggered with initial event and context from the primary.
Secondary lambda function starts a docker container on Fargate with these event and context and adds some more information about the location of the handler to run in the container.
The container’s runner.js gets executed without any timeout limitations and runs the primary lambda function for as long as it's necessary for the process to complete successfully.

Originally published at kalinchernev.github.io.