One of the major advantages of using the Cloud is its Pay-Per-Use model. To make the most of this model, the challenge is to find the computing capacity at the best price that matches your workload. I’m going to explain in this article how I could run a web application in 100% Fargate Spot containers.
EC2 vs Fargate vs Fargate Spot
AWS offers several compute capacity options for running containers orchestrated by ECS :
- EC2 : the good old on-demand compute service for servers on which you can install the ECS container agent
- Fargate : a serverless compute engine that lets you run ECS containers without having to manage EC2 servers
Both of EC2 and Fargate offer spot deals. I have chosen AWS Fargate over EC2 for running containers to achieve cost savings and operational simplicity, as Fargate eliminates the need to manage EC2 instances and offers significant pricing advantages with Fargate Spot.
Having a fault tolerant application
The most important thing to manage when using spot capacity is that your ECS task can be interrupted by AWS at any time. In concrete terms, a SIGTERM system signal is sent to your containers in your ECS task and they have 120 seconds to stop before being hard-killed. Your application must therefore be able to shut down properly before these 120 seconds. The overwhelming majority of web applications take a few seconds to respond, don’t they? So the 120 seconds are not a block. If you configure your ECS services to run at least a count of 2 desired task, this won’t be a problem if only one task is running for a few seconds while the orchestrator starts a second task.
I explain in a little more detail in another article “Handling graceful shutdown of your Docker Cron containers”.
Manage unavailability of Fargate Spot capacity
As written in the AWS documentation, during periods of extremely high demand, Fargate Spot capacity might be unavailable. In concrete terms, if your ECS service is set up to execute tasks in 100% Spot, there is a risk of running out of capacity. A workaround has been created in the hope that one day this issue will be implemented by the AWS team. This workaround allows you to set up two ECS services :
“a primary with only spot capacity and a fallback with only on demand capacity. When the primary emits a task placement error a lambda sets the desired count on the fallback service. When the primary emits a steady state event a lambda sets the fallback desired count to zero.”
My experience with Fargate Spot
In my experience, I was able to run 4 microservices in production with 100% Fargate Spot capacity. The microservice responsible for the frontend supported around 2000 different users per day. At no point during one year we have run out of Fargate Spot capacity. When one task received an interrupt signal, another started within a few seconds, with virtually no effect on response times. The savings we have been able to make thanks to Fargate Spot are considerable and it was a really good choice.