Deploying a Multi-Model Inference Service With AWS Lambda, Synchronous Express Workflows, Amazon API Gateway, and CDK

Step-by-step guide to serverless inference with a DAG on AWS

Published in

The Startup

7 min readFeb 16, 2021

I have recently published a post explaining core concepts on how to deploy an ML model in a serverless inference service using AWS Lambda, Amazon API Gateway, and the AWS CDK.

For some use cases you and your ML team may need to implement a more complex inference workflow where predictions come from multiple models and are orchestrated with a DAG. On AWS, Step Functions Synchronous Express Workflows allow you to easily build that orchestration layer for your real-time inference services.

In this post, I will show how you can create a multi-model serverless inference service with AWS Lambda, Step Functions Synchronous Express Workflows, and Amazon API Gateway. For illustrative purposes, our service will serve house price predictions using 3 models trained on the Boston Housing Dataset with Scikit-Learn. We will use the AWS CDK to deploy the inference stack into an AWS account.

Walkthrough overview

We will deploy the inference service in 3 steps:

We will first train 3 models with scikit-learn, upload the binaries to an S3 bucket, and prepare the 3 Lambda functions for inference.
Then we will define a state machine to orchestrate the inference workflow.
Finally, I will show how you can deploy the inference service into your account with the AWS CDK.

Below is the architecture overview for our multi-model inference service:

Architecture overview of the multi-model inference service

Prerequisites

To go through this example, make sure you have the following:

Visiting the Lambda tutorials, Getting Started with API Gateway, Create a Serverless Workflow, and the CDK workshop could be a good start if those things sound new to you.
An AWS account where the service will be deployed
AWS CDK installed and configured. Make sure to have the credentials and permissions to deploy the stack into your account
Docker to build the Lambda container images and push them to ECR
Python to train the models and define the CDK stack
This GitHub repository cloned into your environment to follow the steps

git clone https://github.com/SofianHamiti/aws-lambda-multi-model-express-workflow.git

Step 1: Creating the inference Lambda Functions

Train the models and upload the binaries to S3

The first ingredients you need in your ML inference service are the models themselves. Here we train a Linear Regression, a Random Forest, and a Support Vector model. For your convenience I have prepared 3 notebooks in the train_models folder and you can run them in your environment to create the models.

Below is the train_random_forest notebook as an example:

Each notebook uses the Boston Housing dataset from scikit-learn, splits it into train/test sets, trains a model with the train set, and saves the binary to your local folder as a .pkl file.

You will need to upload the files into an S3 bucket of your choice like shown below:

The model S3 URIs will be used in the Lambda functions for inference.

Create the inference Lambda handlers

An AWS Lambda function handler is a method that will process requests from the State Machine. When invoked, Lambda will run the handler containing our inference logic and return a response with model predictions.

In our case we have 3 handlers for the 3 models and you can find the code in the lambda_images folder as predict.py files.

The following predict.py file contains the inference handler for the Random Forest model:

This script will be copied to the Lambda container image at build time. You can also copy the model binary into the container image, depending on how tightly coupled you need the inference code and model to be.

In our case, we use boto3 to copy the model from S3 into the /tmp folder. The BUCKET and KEY variables contain the S3 bucket and key of the model and will be declared in step 3 as Lambda environment variables. Implementing it this way allows you to swap between different model versions without changing the container images themselves.

The rest of the code in predict.py reads data from the request event, predicts the House Price using the loaded model and returns a response with predictions.

The first execution of your lambda functions might take up to a few seconds to finish depending on the size of your model and environment complexity. This is called a cold-start and you can visit the following blog if you need to reduce it: Predictable start-up times with Provisioned Concurrency

Package your inference handlers in containers

We can now package our code, runtime and dependencies in containers. We use a provided image compatible with Python and based on Amazon Linux.

You can find the 3 Dockerfiles for our Lambda functions in the lambda_images folder.

Below is the Dockerfile for the Random Forest model:

We use CMD [ “predict.handler” ] to tell Lambda to use the handler function in predict.py when invoked.

You do not need to build and push the container image to Amazon ECR. AWS CDK will do this automatically in step 3.

Deploy your Lambda functions with a CDK nested stack

We will use CDK and nested stacks in step 3 to deploy the components of our service. You can find lamda_stack.py in the stacks folder. Breaking down the stack will allow you to add new models and update the DAG with minimal effort.

Step 2: Using a State Machine Express workflow to orchestrate the inference

AWS Step Functions Express Workflows use fast, in-memory processing for high-event-rate workloads of up to 100,000 events per second. They can also be invoked in response to HTTP requests via API Gateway, and support synchronous requests. See New Synchronous Express Workflows for AWS Step Functions for more details.

They are ideal for high-volume, event-processing workloads requiring low latency, and allow us to build logic in our real-time inference service.

Orchestrate your inference workflow with State Machines

In our example service we want to send data to the API, and receive a combined list of predictions from the 3 models. We will use the Parallel state from Step Functions to concurrently call the 3 inference Lambda functions before sending back the response to the client.

You can find the State Machine nested stack in state_machine_stack.py, and the API Gateway stack in api_stack.py

Below is the visualization of our inference graph and feel free to adjust the workflow based on your needs:

Multi-model prediction workflow with scatter-gather pattern

Step 3: Deploying your inference service with CDK

We will deploy our inference service in the same way as we did in my previous post.

You can find below the main CDK stack to define our inference service infrastructure:

This stack will create the Lambda functions, the State Machine, the API Gateway, and their respective IAM roles.

When testing the API, we will send POST requests with data from a client. It will forward the requests to the State Machine orchestrating the Lambda functions, and return back the response to our client.

You can execute the following commands to install CDK and make sure you have the right dependencies to deploy the service:

npm install -g aws-cdk@1.89.0
python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt

Once this is installed, you can execute the following commands to deploy the inference service into your account:

ACCOUNT_ID=$(aws sts get-caller-identity --query Account | tr -d '"')
AWS_REGION=$(aws configure get region)cdk bootstrap aws://${ACCOUNT_ID}/${AWS_REGION}
cdk deploy --require-approval never

The first 2 commands will get your account ID and current AWS region using the AWS CLI on your computer.

We then use cdk bootstrap and cdk deploy to build the container images locally, push them to ECR, and deploy the inference service stack. This will take a few minutes to complete.