Lock down AWS Fargate networking when using ECR as an image repository (VPC Endpoints)


We setup an 'internal only' Fargate task the other day that locked down all outbound egress traffic. This required more effort than anticipated and I want to have some reference I can look back on in case I run into this issue again.

Updated: September 2019 to include notes on how other VPC Endpoints can impact Fargate tasks

References

Symptoms

When you attempt to run a Fargate task the task instance fails with messages like these:

  • Error Message: CannotPullContainerError: context canceled.
  • Status reason: DockerTimeoutError: Could not transition to started; timed out after waiting 3m0s

You may also see a message that looks like this (guid values scrubbed):

Status reason    CannotPullContainer

Error: error pulling image configuration: Get https://prod-us-east-1-starport-layer-bucket.s3.amazonaws.com/61d6a6b7-a74d-4472-96e4-710ec9a7a96b/d834cc02-771d-4c2b-b7bb-7da5334eea15?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-A

Solution / Configuration

  • First off- you will need to use AmazonProvidedDNS (Route53 for internal DNS). There will be complications if you run your own DNS that can be simplified if you use the Route53 Resolver Service. Using Route53 ensures that the dns records for these AWS services are set to resolve to an address inside your VPC which allows us to meet our compliance goals of restricting outbound internet access

  • Next, be sure to enable the appropriate VPC endoints (interfaces and gateway) in your region and VPC:

    • com.amazonaws.us-east-1.ecr.api
    • com.amazonaws.us-east-1.ecr.dkr
    • com.amazonaws.us-east-1.s3
    • com.amazonaws.us-east-1.logs (Assuming you are using Cloudwatch Logs)
    • Any other endpoints your service will use

    Be sure to check the Enable Private DNS Name checkbox for the endpoints!

  • While configuring the VPC Endpoints, be sure to configure the following settings:

    • Security groups on each VPC Endpoint that will allow inbound traffic for all tasks/services that will use the endpoint. This can be done via inbound IP Address range OR by assigning multiple security groups to the VPC endpoint
    • Ensure that the endpoint is deployed in the correct subnets where your tasks/services will be running from. If the endpoint is not deployed into the subnet where your task is running the task will fail to start with terse messaging
    • I can't stress this enough: ensure the VPC endpoint's ENI is deployed into the subnet/availability zone where your Fargate tasks are deployed. You can have multiple endpoints configured to span across all subnets and AZs if necessary
  • Now for the fun / non-obvious part: In the security group assigned to your Fargate task you need to enable outbound/egress traffic to the S3 endpoint in your VPC. To do this we need to identify the S3 Prefix in your VPC region (identifiers scrubbed):

    $ aws ec2 describe-prefix-lists --region us-east-1
    {
    "PrefixLists": [
        {
            "Cidrs": [
                "9.8.7.0/17",
                "10.9.8.0/15"
            ],
            "PrefixListId": "pl-55443322",
            "PrefixListName": "com.amazonaws.us-east-1.s3"
        },
        {
            "Cidrs": [
                "11.10.9.0/22",
                "12.11.10.0/20"
            ],
            "PrefixListId": "pl-11223344",
            "PrefixListName": "com.amazonaws.us-east-1.dynamodb"
        }
    ]
    }

    From the above output you would take the PrefixListId: pl-55443322 and use that as the value for your security group like so:

Concepts / Conclusion

AWS Elastic Container Registry (ECR) leverages S3 on the back-end. While Amazon provides VPC endpoints to communicate with ECR (API and DKR), this is purely at the 'api level'. Our assumuption was that ECR could be taken as a unit and any back-end dependencies would be handled on the AWS side, which unfortunately proved to not be the case. In order for Fargate to pull a container image it has to have a channel open to S3 to pull the image and start the task. While there is documentation that describes this, it was hard for us to identify and we ended up spinning our wheels for awhile.

As we began rolling out this deployment concept across other environments we discovered that we would get the same type of error messages when the fargate task required the use of other VPC Endpoints, in our case the task would fail to start when the Cloudwatch Logs endpoint was not configured correctly (subnets, inbound security group settings).

Key takeaways for us are to do better POCs of AWS services prior to implementation in production- we were not anticipating this snag and it impacted our timetable to deploy.