5 Simple Steps to MLOps with GitHub Actions, MLflow, and SageMaker Pipelines

Kick-start your path to production with a project template

Sofian Hamiti
Towards Data Science

--

Earlier this year, I published a step-by-step guide to automating an end-to-end ML lifecycle with built-in SageMaker MLOps project templates and MLflow. It brought workflow orchestration, model registry, and CI/CD under one umbrella to reduce the effort of running end-to-end MLOps projects.

Photo by NASA on Unsplash

In this post, we will go a step further and define an MLOps project template based on GitHub, GitHub Actions, MLflow, and SageMaker Pipelines that you can reuse across multiple projects to accelerate your ML delivery.

We will take an example model trained using Random Forest on the California House Prices dataset, and automate its end-to-end lifecycle until deployment into a real-time inference service.

Walkthrough overview

We will tackle this in 4 steps:

  • We will first setup a development environment with IDE, MLflow tracking server, and connect GitHub Actions to your AWS account.
  • Second, I will show how you can experiment, and collaborate easily with your team members. We will also package code into containers, and run them in scalable SageMaker jobs.
  • Then, we will automate your model build workflow with a SageMaker Pipeline and schedule it to run once a week.
  • Finally, we will deploy a real-time inference service in your account with a GitHub Actions-based CI/CD pipeline.

Note: There can be more to MLOps than deploying inference services (e.g: data labeling, date versioning, model monitoring), and this template should give you enough structure to tailor it to your needs.

Prerequisites

To go through this example, make sure you have the following:

  1. Visited Introducing Amazon SageMaker Pipelines if SageMaker Pipelines sound new to you.
  2. Familiarity with MLOps with MLFlow and Amazon SageMaker Pipelines.
  3. Familiarity with GitHub Actions. Github Actions — Everything You Need to Know to Get Started could be a good start if it sounds new to you.
  4. This GitHub repository cloned into your environment.

Step 1: Setting up your project environment

Image by author: Architecture overview for the project.

We will use the following components in the project:

  • SageMaker for container-based jobs, model hosting, and ML pipelines
  • MLflow for experiment tracking and model registry.
  • API Gateway for exposing our inference endpoint behind an API.
  • GitHub as repo, CI/CD and ML pipeline scheduler with GitHub Actions.

If you work in an enterprise, this setup may be done for you by IT admins.

Working from your favourite IDE

For productivity, make sure you work from an IDE you are comfortable with. Here, I host VS Code on a SageMaker Notebook Instance and will also use SageMaker Studio to visualize the ML pipeline.

Image by author

See Host code-server on Amazon SageMaker for install instructions.

Setting up a central MLflow tracking server

We need a central MLflow tracking server to collaborate on experiments and register models. If you don’t have one, you can follow instructions and blog post to deploy the open source version of MLflow on AWS Fargate.

Image by author: Once deployed, make sure you keep the load balancer URI somewhere. We will use it in our project so code, jobs, and pipelines can talk to MLflow.

You can also swap MLflow for native SageMaker options, Weights & Biases, or any other tool of your choice.

Connecting GitHub Actions to your AWS account

Next, we will use OpenID Connect (OIDC) to allow GitHub Actions workflows to access resources in your account, without needing to store AWS credentials as long-lived GitHub secrets. See Configuring OpenID Connect in Amazon Web Services for instructions.

Image by author: We add the GitHub OIDC identity provider to IAM and configure an IAM role that will trust it.

You can setup a github-actions role with the following trust relationships:

We add SageMaker as principal so we can run jobs and pipelines directly from GitHub workflows. Equally so for Lambda and API Gateway.

For illustrative purposes, I attached the AdministratorAccess managed policy to the role. Make sure you tighten permissions in your environment.

Setting up GitHub repo secrets for the project

Finally, we store AWS account ID, region name, and github-actions role ARN as secrets in the GitHub repo. They can be sensitive information and will be used securely by your GitHub workflows. See Encrypted secrets for details.

Image by author: Make sure the secret names map to the names used in your workflows.

We are now ready to go!

Step 2: Experimenting and collaborating in your project

You can find the experiment folder in the repo with example notebooks and scripts. It is typically the place where you start the project and try to figure out approaches to your ML problem.

Below is the main notebook showing how to train a model with Random Forest on the California House Prices dataset, and do basic prediction:

It is a simple example to follow in our end-to-end project and you can run pip install -r requirements.txt to work with the same dependencies as your team members.

This experimental phase of the ML project can be fairly unstructured and you can decide with your team how you want to organize the sub-folders. Also, whether you should use notebooks or python scripts is totally up to you.

You can save local data and files in the data and model folders. I have added them to .gitignore so you don’t end up pushing big files to GitHub.

Structuring your repo for easy collaboration

You can structure your repo any way you want. Just keep in mind that ease of use and reproducibility are key for productivity in your project. So here I have put the whole project in a single repo and tried to find the balance between python project conventions and MLOps needs.

You can find bellow the folder structure with descriptions:

Tracking experiments with MLflow

You can track experiment runs with MLflow, whether you run code in your IDE or in SageMaker Jobs. Here, I log runs under the housing experiment.

Image by author

You can also find example labs in this repo for reference.

Step 3: Moving from local compute to container-based jobs in SageMaker

Running code locally can work in early project stages. However, at some point you will want to package dependencies into reproducible Docker images, and use SageMaker to run scalable, container-based jobs. I recommend reading A lift and shift approach for getting started with Amazon SageMaker if this sounds new to you.

Breaking down the workflow into jobs

You can breakdown your project workflow into steps. We split ours into 2: We run data processing in SageMaker Processing jobs, and model training in SageMaker Training jobs.

Image by author: The processing jobs will run and output a CSV file into S3. The file S3 location is then used when launching the training jobs.

Building containers and pushing them to ECR

The Dockerfiles for our jobs are in the docker folder and you can run the following shell command to push the images to ECR.

sh scripts/build_and_push.sh <ecr-repo-name> <dockerfile-folder>

Using configuration files in the project

To prevent hardcoding, we need a place to hold our jobs’ parameters. Those parameters can include container image URIs, MLflow tracking server URI, entry point script location, instance types, hyperparameters to use in your code running in SageMaker jobs.

We will use model_build.yaml for this. Its YAML structure makes it easy to extend and maintain over time. Make sure to add your MLflow server URI and freshly pushed container image URIs to the config before running jobs.

Running containerized jobs in SageMaker

You are now ready to execute run_job.py and run your code in SageMaker jobs. It will read the config and use code from src/model_build to launch the Processing and Training jobs.

SageMaker will inject prepare.py and train.py at run time into their respective containers, and use them as entry point.

Step 4: Automating your model building

So you have successfully experimented locally and ran workflow steps as container-based jobs in SageMaker. Now you can automate this process. Let’s call it the model_build process, as it relates to everything happening before model versions are registered into MLflow.

We want to automate the container image building, tie our ML workflow steps into a pipeline, and automate the pipeline creation into SageMaker. We will also schedule the pipeline executions.

Building container images automatically with GitHub workflows

In the previous step, we pushed container images to ECR with a script, and ran them in SageMaker jobs. Moving into automation, we use this GitHub workflow to handle the process for us.

Image by author

The workflow looks at Dockerfiles in the docker folder and triggers when changes occur in the repo main branch. Under the hood it uses a composite GitHub action that takes care of logging in into ECR, building, and pushing the images.

The workflow also tags the container images based on the GitHub commit to ensure traceability and reproducibility of your ML workflow steps.

Tying our ML workflow steps into a pipeline in SageMaker

Next, we define a pipeline in SageMaker to run our workflow steps. You can find our pipeline in the src/model_build folder. It basically runs the processing step, get its output data location, and trigger a training step. And same as for jobs, the pipeline executions use parameters defined in our model_build.yaml.

I have added scripts/submit_pipeline.py in the repo to help you create/update the pipeline in SageMaker on-demand. It can help debug and run the pipeline in SageMaker when needed.

Image by author: You can see your updated pipeline in SageMaker Studio and can run executions with the submit_pipeline.py --execute command.

Once happy with the pipeline, we automate its management using the update-pipeline GitHub workflow. It looks for pipeline changes in the main branch and runs submit_pipeline to create/update.

Scheduling our SageMaker Pipeline with GitHub Actions

We can apply different triggers to the pipeline, and here we will schedule its executions with the schedule-pipeline GitHub workflow. It uses a cron expression to run the pipeline at 12:00 on Fridays.

This basic scheduling example can work for some of your use cases and feel free to adjust pipeline triggers as you see fit. You may also want to point the model_build configuration to a place where new data comes in.

Image by author

After each pipeline execution you will see a new model version appear in MLflow. Those are model versions we want to deploy into production.

Step 5: Deploying your inference service into production

Now that we have model versions regularly coming into the model registry, we can deploy them into production. This is the model_deploy process.

Our real-time inference service

We will build a real-time inference service for our project. For this, we want to get model artifacts from the model registry, build an MLflow inference container, and deploy them into a SageMaker endpoint. We will expose our endpoint via a Lambda function and API that a client can call for predictions.

Image by author

In case you need to run predictions in batch, you can build an ML inference pipeline using the same approach as we took for model_build.

Pushing the inference container image to ECR

Alongside the ML model, we need a container image to handle the inference in our SageMaker Endpoint. Let’s push the one provided by MLflow into ECR.

I have added to build-mlflow-image Github workflow to automate this, and it will run the mlflow sagemaker build-and-push-container command to do that.

Defining our API stack with CDK

We use CDK to deploy our inference infrastructure and define our stack in the model_deploy folder. app.py is our main stack file. You will see it read the model_deploy config and create SageMaker Endpoint, Lambda function as request proxy, and an API using API gateway.

Make sure you update your model_deploy config with container image and MLflow tracking server URIs before deploying.

Deploying into production with a multi-stage CI/CD pipeline

We use a Trunk Based approach to deploy our inference API into production. Essentially, we use a multi-stage GitHub workflow hooked to the repo main branch to build, test, and deploy our inference service.

Image by author

The CI/CD workflow is defined in deploy-inference and has 4 steps:

  • build reads a chosen model version binary from MLflow (defined in config), and uploads its model.tar.gz to S3. This is done by mlflow_handler, which also saves the model S3 location in AWS SSM for use in later CI/CD stages. The MLflow handler also transitions the model into Staging in the model registry.
  • deploy-staging deploys the CDK stack into staging so we can run tests on the API before going into production. The job uses a composite GitHub action I have built for deploying CDK templates into AWS.
  • test-api does basic testing of the inference service in Staging. It sends an example payload to the API and checks if the response status is OK. If OK, the MLflow handler will transition the model into Production in the model registry. Feel free to add more tests as you see fit.
  • deploy-prod deploys the CDK stack into production.
Images by author: Model stages will be transitioned in MLflow, as it progresses through the pipeline, and archived when a new version comes in.

Using your inference service

When your service is successfully deployed into production, you can navigate to the AWS CloudFormation console, look at the stack Outputs, and copy your API URL.

Image by author

You are now ready to call your inference API and can use the following example data point in the request body:

You can use tools like Postman to test the inference API from your computer:

Image by author: The API will return the predicted house price value in a few milliseconds.

Conclusion

In this post, I have shared with you an MLOps project template putting experiment tracking, workflow orchestration, model registry, and CI/CD under one umbrella. It’s key goal is to reduce the effort of running end-to-end MLOps projects and accelerate your delivery.

It uses GitHub, GitHub Actions, MLflow, and SageMaker Pipelines and you can reuse it across multiple projects.

To go further in your learnings, you can visit Awesome SageMaker and find in a single place, all the relevant and up-to-date resources needed for working with SageMaker.

--

--