Industrializing an ML platform with Amazon SageMaker Studio

Steps and considerations when rolling out Studio in an enterprise

Published in

Towards Data Science

9 min readOct 12, 2021

Often in large enterprises, ML platform admins need to balance governance and compliance requirements with the need for ML teams to quickly access working environments, and scaffolding to operationalize their solutions.

In SageMaker terms, this translates into accessing secure, well-governed working environments with Studio, and provisioning templated MLOps projects with Pipelines.

SageMaker provides those out of the box and in this post, I will share how an ML platform team can organize, standardize, and expedite their provisioning.

*Example ML workflow with the associated personas (*Image by author)

Walkthrough overview

We will tackle this in 3 steps:

We will first set up multi-account foundations and self-service provisioning so ML teams can access approved Studio environments in minutes.
Then, we will see how you can manage Studio on a day-to-day basis.
Finally, I will show how you can enable templated MLOps projects with SageMaker Pipelines.

Prerequisites

To follow this post, make sure you:

Like DevOps, MLOps is a combination of culture, practices, and tooling, rather than tooling alone. Make sure your enterprise has well-defined operating model and governance processes for ML. MLOps project templates should just be the technical implementation for those.
Are familiar with SageMaker Studio’s architecture.
Adopting a multi-account strategy will play the most important role in how you meet governance, security, and operational requirements in your ML platform. Make sure you have read Setting up secure, well-governed machine learning environments on AWS.
Have familiarity with SageMaker Pipelines and MLOps project templates.

Step 1: Allowing end-users to access Studio

First, we want ML teams to be able to self-provision Studio notebooks so, within minutes, they can start working in approved environments.

This will be the V1 of your ML platform, a sprint 1 your team can work on.

Setting up Studio domains in Project Dev accounts

In this post, I shared example scenarios and corresponding account patterns you can adopt to enable ML projects.

When a new ML project arises, you can create a set of accounts dedicated to it, and locate the project team Studio domain in the Project Dev account.

Example account pattern for experimenting in a project using corporate data. (Image from blog post)

From their Studio domain in Project Dev account, the team can connect to team-wide repositories in the Team Shared Services account and share assets such as code, containers, and ML models. As accounts in the Workloads OU implement corporate guardrails, the ML project team can securely access data from the Data Lake account via a secure gateway. Engineered features developed during this project can be promoted to the feature store in the Data account for later use.

Enabling self-service provisioning of approved Studio resources

You can enable self-service provisioning of approved Studio Domains and User Profiles using AWS Service Catalog. Service Catalog portfolios can either be accessed by the project team from the Team Shared Service account, or Project Dev account.

See Enabling self-service provisioning of Amazon SageMaker Studio resources and learn how to create a Service Catalog portfolio for Studio.

Key considerations for your Studio setup

At the domain level:

Networking for users’ notebooks is managed at the Studio domain level. See Securing Amazon SageMaker Studio connectivity using a private VPC for more details.
A default IAM execution role need to be assigned to the domain. You can choose to use this default role for all its user profiles.
MLOps project templates are Service Catalog products made accessible to Studio via boto3, and visible in the AWS Service Catalog console after enabling permissions. You can run a Lambda function to enable them, as in this example.

At the user profile level:

Authentication in Studio can be done both via SSO and IAM-based methods. Enterprises may have an existing identity provider which federates users when they access the AWS console. If this is your case, you can assign a Studio user profile to each federated identity, using IAM. See the Assigning the policy to Studio users section in Configuring Amazon SageMaker Studio for teams and groups with complete resource isolation for more details.

*Authentication flow for users signing into SageMaker Studio using IAM (*Image by author)

IAM execution roles can also be assigned to each use profile. When using Studio, a user assumes the role mapped to their user profile, which overrides the domain default execution role. This can be used for fine-grained access control within the ML project team. E.g.: separating what Data Scientists and ML engineers can do in the project.
Cost tracking can be done at the account level, as each project has dedicated accounts. AWS has built-in support to consolidate and report costs across your entire set of accounts in the ML platform. If you need costs to be tracked down at the user level, you can apply cost allocation tags to user profiles. E.g.: Tags can be composed of user ID and project ID/cost centre.

You should now have multi-account foundations and self-service provisioning of approved Studio environments, in your ML platform. Your ML teams are now able to access approved Studio environments and work in notebooks using pre-built images.

Step 2: Managing Studio on a day-to-day basis

Here we dive into common customizations ML platform teams may apply to their Studio environments.

This can be a sprint 2, but bear in mind customizations may represent a continuous effort for your platform team, as project teams’ needs evolve over time.

Managing users’ permissions with IAM policies

When a user opens Studio, they assume the execution role associated with their user profile, so permissions need to be controlled on that role.

You can keep permission management simple and use the same IAM role for all user profiles working on a project. This can work since projects are already isolated at account level. Alternatively, you can use multiple IAM roles if ML teams have personas requiring different permissions.

SageMaker provides service-specific resources, actions, and condition context keys for use with IAM. Also see this page for managing permissions to other AWS services.

Example project workflow. Users access their Studio environment and manually launch jobs in SageMaker to prepare data, train, and host models. (Image by author)

At least, your ML teams will want to:

Be able to launch SageMaker Processing, Training, Hyperparameter Tuning, and Autopilot jobs.
Have read access to the SageMaker console page to monitor jobs status.
Anything their containers sends to stdout or stderr is sent to Amazon CloudWatch Logs. So they will need read access to CloudWatch logs to debug jobs and endpoints.

Managing IAM permissions at scale

At some point, you may enable hundreds of ML teams working on your platform and need a way to scale permissions’ management. Automating the process with CI/CD pipelines can help.

Embedding the cfn-policy-validator tool in a CI/CD pipeline. (Image from blog post)

See Validate IAM policies in CloudFormation templates using IAM Access Analyzer for more details.

Restricting instance types in Studio

You may want to restrict instance types ML teams can use in Studio to keep costs under control. E.g.: Allowing only certain CPU instances by default, and GPU ones when required.

For this, you can adjust user profiles’ IAM permissions. In particular, it’s the sagemaker:CreateApp permission you need to restrict on the execution role and sagemaker:InstanceTypes is the condition key to use.

Here is an example policy denying all SageMaker instances types other than t3 instances:

Using Lifecycle Configurations to tailor Studio environments

Studio supports lifecycle configurations. They provide a way to automatically and repeatably apply customizations to Studio environments, including:

Installing custom packages
Configuring auto-shutdown of inactive notebook apps
Setting up Git configuration

See Customize Amazon SageMaker Studio using Lifecycle Configurations for more details.

Automating the Setup of SageMaker Studio Custom Images

Your ML teams may use libraries for running their notebooks, in addition to the ones provided in pre-built images. Studio allows you to create containers with their favourite libraries and attach them as custom images to their domain.

You can implement simple continuous delivery for custom images to automate that process. See Automating the Setup of SageMaker Studio Custom Images for more details.

You can even package this as a Service Catalog product so it is accessible on demand.

Adding more products in your Service Catalog portfolios

Over time, you can add more products in Service Catalog and enable ML teams to do more in their projects.

Below are example products your teams may need:

Amazon Cloud9 environments. Other IDE than Jupyter.
Container image builders pipeline. Code repo + CI pipeline for building docker containers used in SageMaker jobs. The pipeline can enforce linting and security scans on the images.
Feature Groups in SageMaker Feature Store.
Amazon EMR clusters for processing data with Spark from Studio.
Amazon EKS clusters
Anything defined via infrastructure-as-code :)

Step 3: Enabling MLOps projects with SageMaker Pipelines

ML teams may need scaffolding for workflow orchestration, model registry, and CI/CD to reduce the effort of running end-to-end MLOps projects. SageMaker Pipelines addresses this need.

In sprint 3, you can package MLOps project templates and allow ML teams to self-provision them via Service Catalog.

Example MLOps project template for building, training, and deploying models. (Image from blog post)

Building MLOps project templates for cross-account deployments

For MLOps, you can create Project PreProd and Project Prod accounts, alongside the Project Dev accounts, ML teams have access to. This will provide a high level of resource and security isolation between operational stages of their deployments. See account pattern 3 here, and Build a Secure Enterprise Machine Learning Platform on AWS for more details.

Example account layout you can use to productionize an ML project. (Image by author)

You can create custom project templates from scratch or modify the ones provided by SageMaker.

If you are not sure where to start, you can find pre-built MLOps templates in your imported Service Catalog portfolios. You can use them as a basis for making your own. (Image by author)

Your MLOps project templates need to be in a Service Catalog portfolio shared with the Team Shared Services/Automation account. When launched, it will create code repos, cross-account CI/CD pipelines, and artifact stores in this account. Those can be accessed by ML teams, from the Project Dev account, using cross-account IAM permissions.

Using third party tooling in your MLOps templates

Your enterprise may have standardized tooling for code repos and CI/CD.

Below are example integrations you can do in your MLOps project templates:

For CodeCommit, CodeBuild, and CodePipeline, you can simply reuse the abalone example and make it work cross-account.
Third-party source control with GitHub and Jenkins.
GitLab and GitLab Pipelines.
SageMaker has a native model registry but you can also use MLflow if needed. See MLOps with MLFlow and Amazon SageMaker Pipelines.

You can even build templates solving recurring ML use cases.

Conclusion

In this post, I have shared how an ML platform team can enable distributed ML teams to deliver their projects with SageMaker Studio, and key considerations in doing so.

The approach is based on multi-account foundations and AWS Service Catalog to allow self-service provisioning of approved Studio environments and MLOps project templates.

To go further, learn how AstraZeneca has built an industrialized ML platform on AWS to streamline pharmaceutical drug discovery, clinical trials, and patient safety for hundreds of scientists.