Labeling data with Label Studio on SageMaker

Step-by-step guide to deploying Label Studio on a Notebook Instance

Sofian Hamiti
Geek Culture

--

Joint post with Phil Meakins.

Few years ago in my data scientist days, I worked on a computer vision project in which my team spent weeks tweaking NN layers and hyperparameters in order to get better performance. It turned out that simply improving the quality of dataset labels was the key to good models. Does it sound familiar?

Photo by USGS on Unsplash

In the SageMaker world you can use Ground Truth and Ground Truth Plus to create high-quality training datasets.

In this post, I will show how you can set up Label Studio on a SageMaker Notebook Instance and access it in your web browser.

Screenshot from label studio website

We will run Label Studio on a Notebook Instance, access its interface on the same web browser, and label images from the Hot Dog - Not Hot Dog dataset.

Walkthrough overview

We will tackle this in 3 steps:

  • We will first see what Label Studio is, why running it on SageMaker is a good idea, and introduce out example dataset.
  • Then, we will setup Label Studio in our Notebook Instance.
  • Finally, we will label a few images from the Hot Dog - Not Hot Dog dataset.

Prerequisites

To go through this example, make sure you have the following:

  1. Access to a SageMaker Notebook Instance. A small one should be enough for labeling data.
  2. Familiarity with Label Studio. You can visit Introducing Label Studio, a swiss army knife of data labeling if it sounds new to you.
  3. The Hot Dog — Not Hot Dog dataset (or any of your choice).

Step 1: When open source meets managed cloud environments

Label Studio is a popular open source data labeling tool

It lets you label data types like audio, text, images, videos, and time series with a simple UI and export to various model formats. For example, I have met people who label large medical images with it.

You can install via pip or run on docker, and access it via a web browser. In our case, we will use the docker option.

The following intro video shows how to work with the tool:

Using SageMaker as managed hosting for Label Studio

Hosting Label Studio on a Sagemaker Notebook Instance is a better option, as compared to DYI’ing the setup on EC2. Notebook Instances allow you to access a managed ML environment based on Jupyter. SageMaker manages the creation of the underlying instances and authorization proxy so you can focus on labeling data.

We will host the Label Studio container on the Notebook Instance and link the container storage to the instance EBS volume. The Notebook Instance comes with a proxy allowing us to access Label Studio on the browser.

“It only does hot dogs? No. and not hot dog”

I’m a fan of Silicon Valley and we can use images from Hot Dog - Not Hot Dog as example. It’s a simple image classification dataset and feel free to use any dataset of your choosing.

Image from post

Step 2: Setting up Label Studio on a Notebook Instance

We will setup Label Studio in 3 parts:

Access the instance and launch a Terminal

You can launch a terminal via the Jupyter/Jupyterlab of your Notebook Instance. If you use Jupyter, the button should be on the top-right corner:

Image by author

Run Label Studio in Docker

Before launching Label Studio with docker, we need to copy the base of the instance URL.

Image by author: copy the base URL of your Instance. It should look like https://your-notebook-instance-name.notebook.your-region.sagemaker.aws

We use this base URL in the --host parameter of the Label Studio command, so the proxy knows how to redirect the web pages correctly.

Then, paste the base URL in the following command and run it on the terminal:

It will pull the image from Dockerhub, run the Label Studio container, and attach the /home/ec2-user/SageMaker/mydata instance EBS folder to the container storage. This will allow you to persist the labeled data.

The command should take a few seconds to run and that is all for this part.

Image by author: You should see the following after running the docker command.

Access Label Studio on your web browser

Now copy your Studio URL, change it a bit, and past it into a new browser tab:

Images by author: On the left is my original terminal URL. I replaced /terminal/1 with /proxy/8080/ and pasted it in a new browser tab.

And voila.

Image by author: Label Studio will load and you can log in with account into it

Step 3: Labeling data with Label Studio

Now is the fun part!

You can download the dataset into your laptop, create a new labeling project and upload few images into Label Studio.

Images from author: I use the Image Classification template and create 2 classes for hot dog/not hot dog

Note You may have data in S3, and you can use the AWS CLI or other means to copy them onto the instance EBS volume.

And then you can start labeling and submit your work.

Images by author

The labeled results will be persisted in the /home/ec2-user/SageMaker/mydata folder of your Notebook Instance.

Image by author

Conclusion

Labeling high-quality data is of utmost importance in an ML project. Alongside using Ground Truth, you can also host open source labeling tools on SageMaker.

In this post, I showed how you can setup and run Label Studio in a SageMaker Notebook Instance.

To go further, you can also read Hosting VS Code on SageMaker Studio and learn how host VS Code in your cloud ML environment.

--

--