Deploying the Nvidia Triton Inference Server on Amazon ECS

Step-by-step guide to Triton deployment on ECS using CDK

NVIDIA Triton Inference Server is an open-source software ML teams can deploy their models with. It supports model formats from Tensorflow, PyTorch, ONNX, other popular frameworks, and provides a set of features to manage them.

Concurrent model execution and dynamic batching are 2 features you may find interesting in Triton, as they allow to run multiple models on the same GPU resources, and increase inference throughput.

You can host Triton in multiple ways on AWS. In this post, I show how you can deploy it on Amazon ECS, using the AWS CDK.

Image by author

We will deploy 2 image classification models in Triton. We will use S3 as model repository, and a Triton base image from the NGC catalog. It will be an example Triton setup that you can adjust to your needs.

Visiting the Nvidia Triton documentation, the ECS workshop, and the CDK workshop could be a good start if those things sound new to you.

Walkthrough overview

We will deploy the inference server in 3 steps:

  • We will first download the models from the Triton examples, upload them to an S3 bucket, and prepare the container image for deployment.
  • Then, we will deploy the Inference Server on ECS with the AWS CDK.
  • Finally, I will show how you can test your inference server with a client script.

Below is the architecture overview for our multi-model inference server:

Architecture overview Triton inference Server on ECS

Prerequisites

To go through this example, make sure you have the following:

  1. Familiarity with Nvidia Triton.
  2. An AWS account where the server will be deployed
  3. AWS CDK installed and configured. Make sure to have the credentials and permissions to deploy the stack into your account
  4. Docker to build the Triton container image and push it to ECR
  5. This GitHub repository cloned into your environment to follow the steps

Step 1: Preparing model repository and server image

The first ingredient we need for inference are the model themselves. Thankfully, Nvidia provides a fetch_models.sh script you can run to download example models in your environment. I will use the inception_graphdef and densenet_onnx models here, and feel free to replace them with your own.

Using S3 as model repository

Nvidia Triton uses model repositories to store and version models. The directories and files that compose a model repository must follow a required layout to be read by the server. Triton supports Amazon S3 as source for model repositories.

You can create an S3 bucket and upload the models to it, as I did below:

Image by author: My S3 bucket after uploading the inception_graphdef and densenet_onnx models

Preparing the Triton Inference Server image

The second ingredient we need is the Triton Inference Server itself. We will use a container base image from the NGC catalog with version 21.05 of Triton. We just add the AWS CLI into the image so we can use AWS commands from it.

Below is the Dockerfile for the Random Forest model:

We also add a run.sh script, in charge of starting the tritonserver in the container:

This script will download the models from S3 into the /tmp folder, so triton can use them. aws s3 sync helps for this and will use the IAM role of the task, when running on ECS.

Note: Triton could point to the S3 bucket directly if you set AWS credentials in the container, but I have seen issues on Github mentioning this as a challenge. This may evolve over time and feel free to propose other options in the comments.

Step 2: Deploying your inference server with CDK

Here we launch our Triton Inference Server on ECS. We will deploy this stack with CDK, as we did in some of my previous posts.

You can find below the main CDK stack defining our infrastructure. Make sure you add your S3 model repository location to it:

The stack will create IAM role, VPC, and ApplicationLoadBalancedEc2Service to host the Triton Server.

Here are a few considerations you may have when deploying your server:

  • We use a g4dn.2xlarge with 1 GPU to collocate the 2 image classification models. If you use a multi-GPU instance, you can set gpu=n and let Triton do the scaling.
  • Equally, this example ECS service autoscales based on CPU utilization. If your model inference is mostly GPU-based, you may want to adjust the scaling policy.
  • Here, we use port 8000 to listen to HTTP requests but feel free to expand to 8001 for GRPC, and 8002 for metrics as needed.
  • In this illustrative example, the load balancer is launched on a public subnet and is internet facing. For security purposes, you may want to provision an internal load balancer in your VPC private subnets where there is no direct connectivity to the outside world.

Deploying your inference service

You can execute the following commands to install CDK and make sure you have the right dependencies to deploy the service:

python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt
npm install -g aws-cdk@1.102.0

Once this is installed, you can execute the following commands to deploy the inference service into your account:

ACCOUNT_ID=$(aws sts get-caller-identity --query Account | tr -d '"')
AWS_REGION=$(aws configure get region)
cdk bootstrap aws://${ACCOUNT_ID}/${AWS_REGION}
cdk deploy --all --require-approval never

The first 2 commands will get your account ID and current AWS region using the AWS CLI on your computer.

We then use cdk bootstrap and cdk deploy to build the container image locally, push it to ECR, and deploy the stack. This will take a few minutes to complete.

Image by author: Once complete, you can copy your load balancer uri from the Outputs tab of the stack

Step 3: Testing your inference service

We are now ready to serve image classification predictions with Triton!

For this, Nvidia provides example C++ and Python client libraries to talk with Triton. We will follow the image classification example and use image_client.py to send images to Triton.

You can execute the following commands to install the tritonclient library and make sure you have the right dependencies:

# install triton client and dependencies
# Note: Currently pip install is only available on Linux
pip install pillow
pip install nvidia-pyindex
pip install tritonclient[all]
pip install attrdict

You can send images to your Triton server with commands such as this one:

python image_client.py -u <YOUR LOAD BALANCER URI> -m densenet_onnx -s INCEPTION -image_filename images/car.jpg
Image by author: I test this from a SageMaker Notebook Instance terminal

Conclusion

With Nvidia Triton, you can run models or model instances simultaneously on the same GPU resources. It’s a good way to make sure you get good resource utilization when hosting your models for inference.

In this post, we deployed a Triton Inference Server, with 2 images classification models on Amazon ECS, in 3 simple steps.

To go further, you can also visit this interesting post benchmarking Triton performance for transformers models.

ML Solutions Architect. Making MLOps easy, one post at a time.