Basic pipeline for AI model training and inference on AWS

Overview: In this lesson, we will deploy an AI model on SageMaker using a customized image. The deployment includes creating a training job and an endpoint server for handling predictions.

Note: The main focus of this lesson is to analyze the setup and workflow. Detailed step-by-step instructions on the console will be omitted.

Prerequisite

Basic Knowledge of AI

To easily follow the steps in this lesson, you need fundamental knowledge of AI, models, how they are created, and the processes of training and prediction.

Basic Knowledge of RESTful API and Hosting Servers (Optional)

This lab involves some basic knowledge of RESTful APIs, including methods like GET and POST, as well as hosting on an endpoint server. Understanding these concepts will help you better grasp the workflow. However, the hosting setup in this lesson is pre-configured, so you won’t need to focus on it.

Docker

Docker is a software platform that enables the creation, deployment, and management of applications within containers. The Customized Image used in this lesson will be built with Docker, so having a basic understanding of Docker will help you follow the steps smoothly.

If you haven’t installed Docker yet, you can download it from the Docker homepage.

Source Code

You can download the source code used in this lesson here.

Architecture

Before diving into the detailed steps, let’s first take a look at the AWS architecture and the internal structure of the image.

AWS Architecture

AWS architecture

Overview of Workflow:

After creating a Docker image on your local machine, it will be pushed to ECR (Elastic Container Registry), which manages container images. Then, an S3 bucket will be created to store the model and dataset, as well as the output of the training process.

Next, a Lambda function will be responsible for creating training jobs and inference endpoints using the model stored in S3 and the image stored in ECR.

Internal Architecture of the Image

Stack

Nginx is a web server that handles incoming HTTP requests and manages container traffic.
Gunicorn is a pre-forking WSGI server that runs multiple instances of the application and distributes the load between them.
Flask is a lightweight Python web framework used to configure request handling. It responds to requests sent to /ping and /invocations without requiring extensive configuration.

This is the internal architecture of a container image designed to handle prediction (inference) requests via HTTP. (For training, this setup is not required.)
It follows a typical Python-based architecture optimized for SageMaker endpoints. However, understanding the internal workings of this model is not necessary. You don’t need to install or modify it—provided code snippets will handle this automatically without requiring changes.

Detailed Architecture

Before diving into the details, let’s take a look at the code organization of this sample project:

1
2
3
4
5
6
7
8
9
|--program
|       |--nginx.conf
|       |--predictor.py
|       |--serve
|       |--train
|       `--wsgi.py
|
|--Dockerfile
`--requirements.txt

Dockerfile is the Docker file used to build the Docker image, and requirements.txt contains the necessary libraries to install. The files inside the program directory define how the container (server) operates. Among them, nginx.conf, wsgi.py, and serve are pre-built and do not require modifications. The train file defines how the model is trained, while predictor.py specifies how the model makes predictions (handling HTTP responses).

Dockerfile Analysis

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
FROM python:3.11.9-slim

RUN apt-get -y update && apt-get install -y --no-install-recommends \
         nginx \
         ca-certificates \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt /opt/program/requirements.txt
WORKDIR /opt/program/

RUN pip install --upgrade pip \
    && pip install --no-cache-dir -r requirements.txt

ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"

#------------------------------------------------
# RUN mkdir -p /opt/ml/input/data/dataset
# RUN mkdir -p /opt/ml/input/data/model
# RUN mkdir -p /opt/ml/input/config
# RUN mkdir -p /opt/ml/output/data
# RUN mkdir -p /opt/ml/model

# COPY iris.csv /opt/ml/input/data/dataset/iris.csv
# COPY decision-tree-model.pkl /opt/ml/model/decision-tree-model.pkl
# COPY model.tar.gz /opt/ml/input/data/model/model.tar.gz
# COPY hyperparameters.json /opt/ml/input/config/hyperparameters.json
#------------------------------------------------

COPY program /opt/program

FROM python:3.11.9-slim: This is a pre-built image that includes Python. We pull it as a base to build our own customized image.
- The slim version is a lightweight variant, meaning it doesn’t include common services, making it highly optimized for our use case.
RUN apt-get -y update && apt-get install -y --no-install-recommends nginx ca-certificates && rm -rf /var/lib/apt/lists/
- RUN apt-get -y update: Downloads the latest package lists.
- apt-get install -y --no-install-recommends nginx ca-certificates:
  - nginx: Installs the web server.
  - ca-certificates: Installs the CA certificate bundle for SSL/TLS authentication.
  - --no-install-recommends: Installs only the essential dependencies, reducing image size.
- rm -rf /var/lib/apt/lists/*: Removes cached package lists after installation to save space.
COPY requirements.txt /opt/program/requirements.txt: Copies requirements.txt to /opt/program/, listing all required Python dependencies.
WORKDIR /opt/program/: Sets the working directory inside the container. Equivalent to running cd /opt/program.
RUN pip install --upgrade pip && pip install --no-cache-dir -r requirements.txt
- Upgrades pip to the latest version.
- Installs dependencies listed in requirements.txt. The --no-cache-dir flag prevents caching, reducing image size.
Environment Variables
- ENV PYTHONUNBUFFERED=TRUE: Outputs logs immediately instead of buffering them.
- ENV PYTHONDONTWRITEBYTECODE=TRUE: Prevents Python from generating .pyc bytecode files, reducing container size.
- ENV PATH="/opt/program:${PATH}": Adds /opt/program to the PATH environment variable for easier access to scripts.
COPY program /opt/program: Copies the entire program directory into /opt/program inside the image. This directory contains the code and configuration files needed for the container’s functionality.

With this setup, we now understand how our Docker image is structured. Next, we will explore the installed dependencies and their significance.

Analysis of requirements.txt

As we know, this file lists the necessary dependencies for the project. Specifically:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
requests
flask
gunicorn
awscli

numpy
scipy
scikit-learn
joblib
pandas

requests, flask, gunicorn: These libraries are used for hosting and configuring HTTP API communication.
- They are essential for handling incoming and outgoing requests correctly.
numpy, scipy, scikit-learn, joblib, pandas: These are machine learning libraries used for model creation, training, and inference.
- The specific libraries needed depend on the chosen model and algorithm.
- For example, in this case, they are used to support Scikit-Learn’s Decision Tree Classifier.
- This part should be customized based on your specific needs.

With this, we have completed building the image and installing all necessary dependencies. Next, let’s explore the structure and configuration of this image.

Analysis of nginx.conf, wsgi.py, and serve

These files are pre-configured to handle server operations and do not affect the training or prediction process.

It is recommended not to modify these files.

Train Analysis

File structure defined by Sagemaker

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[DIR] /opt/ml
  [DIR] output
    [DIR] metrics
      [DIR] sagemaker
    [DIR] data
    [DIR] profiler
      [DIR] framework
  [DIR] model
  [DIR] input
    [DIR] config
      [FILE] init-config.json
      [FILE] metric-definition-regex.json
      [FILE] inputdataconfig.json
      [FILE] resourceconfig.json
      [FILE] hyperparameters.json
    [DIR] data
      [DIR] <channel name>
        [FILE] <file>
        ...
      [DIR] <channel name>
        [FILE] <file>
        ...
      [FILE] dataset-manifest
      [FILE] model-manifest
  [DIR] sagemaker
    [DIR] ssm
    [DIR] warmpoolcache

Let’s analyze each section of the train file

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
prefix = '/opt/ml/'

input_path = os.path.join(prefix, 'input/data')
output_path = os.path.join(prefix, 'output/data')
model_path = os.path.join(prefix, 'model')
param_path = os.path.join(prefix, 'input/config/hyperparameters.json')

dataset_channel_name='dataset'
dataset_training_path = os.path.join(input_path, dataset_channel_name)

model_channel_name='model'
model_training_path = os.path.join(input_path, model_channel_name)

Let’s understand the path variables defined here (we will analyze how they are used later)

prefix = '/opt/ml': Defines the prefix path. For Sagemaker algorithms, this is the designated directory that contains the files Sagemaker provides when initializing the train process (not user-defined). Specifically:
input_path = os.path.join(prefix, 'input/data'): Path to the input directories provided via channels.
output_path = os.path.join(prefix, 'output'): Location for storing training output. After training, Sagemaker compresses this folder into output.tar.gz and uploads it to S3 as the output of the training job. If you want specific output, save it here (usually logs), or in this lab, failure.log.
model_path = os.path.join(prefix, 'model'): Location for saving the model after training. If you want to persist the trained model, save the relevant files here. Sagemaker automatically packages the entire directory into a .tar.gz file and uploads it to S3 as the output of the training job.
param_path = os.path.join(prefix, 'input/config/hyperparameters.json'): Location of hyperparameters for training. The definition of these parameters will be discussed later.
dataset_channel_name='dataset': During training, we primarily focus on model and data. This directory contains the dataset, and we will analyze how it is provided later.
model_channel_name='model': Similarly, this is where the model (if needed) is stored for continuing training from a previously saved state.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def list_files_recursive(base_path, indent_level=0):
    try:
        for item in os.listdir(base_path):
            item_path = os.path.join(base_path, item)
            if os.path.isdir(item_path):
                print(' ' * indent_level * 2 + f"[DIR] {item}")
                # Recursively list subdirectories
                list_files_recursive(item_path, indent_level + 1)
            else:
                print(' ' * indent_level * 2 + f"[FILE] {item}")
    except FileNotFoundError as e:
        print(' ' * indent_level * 2 + f"FileNotFoundError: {e}")
    except PermissionError as e:
        print(' ' * indent_level * 2 + f"PermissionError: {e}")
    except Exception as e:
        print(' ' * indent_level * 2 + f"Exception: {e}")

This function simply lists the file structure from a given path. While not necessary for actual training, it is useful for testing and debugging.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def extract_tar_gz(file_path, extract_path='.'):
    try:
        with tarfile.open(file_path, 'r:gz') as tar:
            tar.extractall(path=extract_path)
            print(f"Extracted files to {extract_path}")
    except tarfile.TarError as e:
        print(f"TarError: {e}")
    except FileNotFoundError:
        print(f"FileNotFoundError: The file {file_path} does not exist.")
    except PermissionError:
        print(f"PermissionError: You do not have permission to access {file_path}.")
    except Exception as e:
        print(f"Exception: {e}")

As mentioned earlier, if training starts from a pre-existing model, it must be provided at model_training_path, usually as model.tar.gz. Since model.tar.gz is Sagemaker’s standard training output format, it makes sense to handle model using this format. This function extracts the file.

extract_tar_gz(model_training_path + '/model.tar.gz', extract_path=model_training_path): Extracts model.tar.gz in model_training_path and stores it in the same directory.
list_files_recursive('/opt'): Lists files from /opt (useful for debugging).

1
2
with open(param_path, 'r') as tc:
    trainingParams = json.load(tc)

Reads the file containing training parameters and stores it in trainingParams (if available).

input_files = [ os.path.join(dataset_training_path, file) for file in os.listdir(dataset_training_path) ]: Retrieves dataset files from dataset_training_path.
raw_data = [ pd.read_csv(file, header=None) for file in input_files if file.endswith(".csv")]: Reads CSV files from the previous step.
max_leaf_nodes = trainingParams.get('max_leaf_nodes', None): Retrieves the max_leaf_nodes parameter (if provided).
clf = load(model_training_path + '/decision-tree-model.pkl'): Loads the model extracted in step 4 to continue training.

1
2
with open(os.path.join(model_path, 'decision-tree-model.pkl'), 'wb') as out:
    pickle.dump(clf, out)

Saves the model to model_path. Sagemaker will later compress this into model.tar.gz and upload it to an S3 bucket.

1
2
3
4
5
6
7
8
except Exception as e:
    # Write out an error file. This will be returned as the failureReason in the
    # DescribeTrainingJob result.
    trc = traceback.format_exc()
    with open(os.path.join(output_path, 'failure.log'), 'w') as s:
        s.write('Exception during training: ' + str(e) + '\n' + trc)
    # Printing this causes the exception to be in the training job logs, as well.
    print('Exception during training: ' + str(e) + '\n' + trc, file=sys.stderr)

If an error occurs, it is logged in output_path. Sagemaker will then compress the entire folder into output.tar.gz and store it in an S3 bucket.

Analysis of predictor.py

This file defines the reception and processing of prediction requests through an endpoint. The ScoringService class is built using a design pattern called singleton, ensuring that only one model instance is deployed throughout the endpoint’s lifecycle.

This file should be modified to suit your needs.

1
2
3
4
5
6
7
@classmethod
def get_model(cls):
    """Get the model object for this instance, loading it if it's not already loaded."""
    if cls.model == None:
        with open(os.path.join(model_path, "decision-tree-model.pkl"), "rb") as inp:
            cls.model = pickle.load(inp)
    return cls.model

This function loads the model if it has not been initialized and retrieves it if it has already been initialized.

Note: Unlike in train, the model loaded for prediction is automatically extracted. When provided as model.tar.gz, SageMaker will automatically extract it and place it in the model_path. In contrast, during train, model.tar.gz remains unchanged.

1
2
3
4
5
6
7
8
9
@classmethod
def predict(cls, input):
    """For the input, perform predictions and return them.

    Args:
        input (a pandas dataframe): The data on which to perform the predictions. There will be
            one prediction per row in the dataframe."""
    clf = cls.get_model()
    return clf.predict(input)

Implements the predict function for the model.

3.app = flask.Flask(__name__): Initializes a Flask app to serve predictions.

1
2
3
4
5
6
7
8
@app.route("/ping", methods=["GET"])
def ping():
    """Determine if the container is working and healthy. In this sample container, we declare
    it healthy if we can load the model successfully."""
    health = ScoringService.get_model() is not None  # You can insert a health check here

    status = 200 if health else 404
    return flask.Response(response="\n", status=status, mimetype="application/json")

Defines a GET method at /ping to check if the container is operational.

Note: This method is essential because SageMaker continuously pings the container to check its status. If this method is missing, it will cause errors.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
@app.route("/invocations", methods=["POST"])
def transformation():
    """Perform inference on a single batch of data. This sample server accepts CSV data, converts
    it to a pandas dataframe, and then converts the predictions back to CSV (one prediction per line)."""
    data = None

    # Convert from CSV to pandas
    if flask.request.content_type == "text/csv":
        data = flask.request.data.decode("utf-8")
        s = io.StringIO(data)
        data = pd.read_csv(s, header=None)
    else:
        return flask.Response(
            response="This predictor only supports CSV data", status=415, mimetype="text/plain"
        )

    print("Invoked with {} records".format(data.shape[0]))

    # Perform the prediction
    predictions = ScoringService.predict(data)

    # Convert from numpy back to CSV
    out = io.StringIO()
    pd.DataFrame({"results": predictions}).to_csv(out, header=False, index=False)
    result = out.getvalue()

    return flask.Response(response=result, status=200, mimetype="text/csv")

This is the main function that defines how requests are handled. You should customize it to your needs.

Note: Do not modify @app.route("/invocations", methods=["POST"]), but you should customize the function def transformation():.

Summary: This file provides two key methods for handling predictions.

Analysis of ping_test.py and post_test.py

These two files are used for local testing to check if the image is working correctly before pushing it to ECR. They can be customized as needed.

ping_test.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import requests

def ping_service(url):
    try:
        response = requests.get(url)
        print(f"Status Code: {response.status_code}")
        print(f"Response Text: {response.text}")

        # Check response format before parsing JSON
        try:
            response_json = response.json()
            print(f"Response JSON: {response_json}")
        except requests.exceptions.JSONDecodeError:
            print("Response is not in JSON format")
    
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")

ping_service("http://localhost:8080/ping")

This script sends a ping request to http://localhost:8080, as in the default nginx configuration, the program runs on port 8080.

post_test.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import requests

url = "http://localhost:8080/invocations"

data = "1,2,3,4"

# Send POST request
response = requests.post(url, data=data, headers={'Content-Type': 'text/csv'})

# Print results
if response.status_code == 200:
    try:
        # Check and parse JSON data
        print(response.json())
    except requests.exceptions.JSONDecodeError:
        print("Received non-JSON response")
        print("Response text:", response.text)
else:
    print(f"Request failed with status code {response.status_code}")
    print("Response text:", response.text)

Similarly, this script performs invocation. It should be customized according to the specific project requirements.

Build Image Locally and Test

In this section, we will build the image locally, create an environment that simulates its operations when deployed on Sagemaker, and perform testing.

Prepare

To simulate the Sagemaker algorithm, we need to reconfigure the Dockerfile.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
FROM python:3.11.9-slim

RUN apt-get -y update && apt-get install -y --no-install-recommends \
         nginx \
         ca-certificates \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt /opt/program/requirements.txt
WORKDIR /opt/program/
    
RUN pip install --upgrade pip \
    && pip install --no-cache-dir -r requirements.txt
    
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"

#------------------------------------------------
RUN mkdir -p /opt/ml/input/data/dataset
RUN mkdir -p /opt/ml/input/data/model
RUN mkdir -p /opt/ml/input/config
RUN mkdir -p /opt/ml/output/data
RUN mkdir -p /opt/ml/model

COPY iris.csv /opt/ml/input/data/dataset/iris.csv
COPY decision-tree-model.pkl /opt/ml/model/decision-tree-model.pkl
COPY model.tar.gz /opt/ml/input/data/model/model.tar.gz
COPY hyperparameters.json /opt/ml/input/config/hyperparameters.json
#------------------------------------------------

COPY program /opt/program

Here, we have created directories and files similar to the ones that the Sagemaker algorithm will configure.

Build Image and Test

Build Image

build image

Step 1: Navigate to the source code directory.
Step 2: Run the command docker build -t decision-tree .

List Images

list image

Run the command: docker images to list all images. You can see that the image named decision-tree has been created.

Run Container and Train

run container and train

Run the command: docker run <image> train

Explanation: Since the Sagemaker algorithm by default runs the command docker run <image> train, configuring the Dockerfile with ENV PATH="/opt/program:${PATH}" is necessary to execute this command properly.

Run Container and Serve

Run the command: docker run -d -p 8080:8080 <image> serve

Explanation: Similarly, the Sagemaker algorithm by default runs the command docker run <image> serve. However, since we need to test the ping and prediction from our local machine, we use the following options:

-d: Runs the container in the background, meaning it does not enter the container environment. This allows us to execute ping_test.py and post_test.py from our local machine.
-p 8080:8080: Maps port 8080 from the container to port 8080 on our local machine. Since the program inside the container is configured to run on port 8080, this mapping allows us to easily test ping_test.py and post_test.py.

Test Ping and Post

test ping and post

Run the commands: python ./ping_test.py and python ./post_test.py. If the results are as expected, the setup is successful.

Create Resources on AWS

After testing locally, we will begin deploying it to the AWS, starting with creating the necessary resources.

Create an ECR Repository

Before proceeding, we need to modify the Dockerfile by simply commenting out the sections within the --- --- markers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#------------------------------------------------
# RUN mkdir -p /opt/ml/input/data/dataset
# RUN mkdir -p /opt/ml/input/data/model
# RUN mkdir -p /opt/ml/input/config
# RUN mkdir -p /opt/ml/output/data
# RUN mkdir -p /opt/ml/model

# COPY iris.csv /opt/ml/input/data/dataset/iris.csv
# COPY decision-tree-model.pkl /opt/ml/model/decision-tree-model.pkl
# COPY model.tar.gz /opt/ml/input/data/model/model.tar.gz
# COPY hyperparameters.json /opt/ml/input/config/hyperparameters.json
#------------------------------------------------

Step 1: Search for ECR

Step 2: Create a repository

create repository

Step 3: Enter the repository name

enter repository’s name

Successfully created:

repository created successfully

Step 5: Click on “View push commands”

click view push commands

Step 6: Execute the steps shown in “View push commands”

view push commands

Step 7: Successfully pushed the image

push image successfully

Create S3 Bucket

Step 1: Search for S3

Step 2: Click “Create bucket”

Step 3: Enter the bucket name

enter bucket name

Successfully created:

bucket created successfully

Step 4: Open the newly created bucket and click “Create folder”

click create folder

Step 5: Enter the name input and click “Create folder”

enter input folder

Step 6: Similarly, create a new folder named output and click “Create folder”

enter output folder

Step 7: Similarly, create a new folder named train-script and click “Create folder”

enter script folder

Step 8: Inside the input folder, create two additional folders: dataset and model

create dataset folder

Step 9: Open the dataset folder and click the Upload button

click upload

Step 10: Select the iris.csv file and upload it

upload training set

Step 11: Similarly, go to the model folder and upload the file model.tar.gz

upload model

Create policy and role

Create Policy

Granting Permission to Create a Training Job

Step 1: Search for the IAM service

Step 2: Navigate to the Policies tab and click Create policy

create policy

Step 3: In the Service section, select SageMaker, then switch to the JSON tab and paste the following snippet:

choose SageMaker service

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "logs:CreateLogGroup",
            "Resource": "arn:aws:logs:{region}:{account-id}:*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:{region}:{account-id}:log-group:/aws/sagemaker/TrainingJobs:*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::customized-sagemaker-image-decision-tree-bucket"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::customized-sagemaker-image-decision-tree-bucket/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "ecr:GetAuthorizationToken",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ecr:BatchCheckLayerAvailability",
                "ecr:BatchGetImage",
                "ecr:GetDownloadUrlForLayer"
            ],
            "Resource": "arn:aws:ecr:{region}:{account-id}:repository/decision-tree"
        }
    ]
}

Explanation:
This policy grants the necessary permissions for SageMaker to perform the following actions:

Create a training job
Generate logs
Retrieve objects (dataset, model) from S3
Upload objects (trained model) to S3
Pull images from ECR

Make sure to customize {region} and {account-id} according to your setup.

Step 4: Enter the policy name and click Create policy

enter policy name

Successfully created:

policy created successfully

Granting Permissions to Create an Endpoint
Following the same steps as above, create a policy named create-decision-tree-sagemaker-endpoint-policy with the following JSON content:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "logs:CreateLogGroup",
            "Resource": "arn:aws:logs:{region}:{account-id}:*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:{region}:{account-id}:log-group:/aws/sagemaker/Endpoints:*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::customized-sagemaker-image-decision-tree-bucket"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::customized-sagemaker-image-decision-tree-bucket/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "ecr:GetAuthorizationToken",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ecr:BatchCheckLayerAvailability",
                "ecr:BatchGetImage",
                "ecr:GetDownloadUrlForLayer"
            ],
            "Resource": "arn:aws:ecr:{region}:{account-id}:repository/decision-tree"
        }
    ]
}

Create Role

Create a Role for Training Jobs

Step 1: Go to the Roles tab and select Create role

search IAM

Step 2: Select the SageMaker service and click Next

choose role service

Step 3: Click Next again

add permission

Step 4: Enter the role name

enter role name

Ensure the Trust policy looks like this:

trust policy

Click Create role

Step 5: Once created successfully, search for and click on the newly created role

Step 6: Select the default policy and click Remove

remove policy

Step 7: Click Add permissions, then Attach policy, and enter the name of the policy we created earlier

attach policy

Successfully attached:

policy attached successfully

Create a Role for the Endpoint

Follow the same steps above to create a role named create-decision-tree-sagemaker-endpoint-role, remove the default policy, and attach the policy create-decision-tree-sagemaker-endpoint-policy.

Create Lambda Functions

We will create Lambda functions to perform the following tasks: creating a training job, creating an inference endpoint, and invoking a prediction endpoint.

Note: For the Lambda functions created in this section, it is recommended to set the General configuration Timeout to more than one minute.
To do this, follow these steps:

Access the Lambda function.
Go to the Configuration tab.
Select General configuration.
Click Edit, increase the Timeout duration, and then Save.

search IAM

Create Lambda Function: Training Job Creation

In this section, we will create a Lambda function responsible for creating a training job

Step 1: Search for the Lambda service

Step 2: Click the Create function button

create function

Step 3: Enter the function name create-training-job-decision-tree-function, select the runtime as Python 3.12, and click Create function

enter function name

Successfully created:

create success

Step 4: Go to the Configuration tab, select Permissions, and click on Role name

open function role

The following page will open:

role opened successfully

Step 5: Under Permissions policies, select the first policy:

Navigate to a new tab and click Edit:

edit function policy

Step 6: Paste the following content into the Policy editor

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": "logs:CreateLogGroup",
			"Resource": "arn:aws:logs:{region}:{account-id}:*"
		},
		{
			"Effect": "Allow",
			"Action": [
				"logs:CreateLogStream",
				"logs:PutLogEvents"
			],
			"Resource": [
				"arn:aws:logs:{region}:{account-id}:log-group:/aws/lambda/create-training-job-decision-tree-function:*"
			]
		},
		{
			"Sid": "Statement1",
			"Effect": "Allow",
			"Action": [
				"sagemaker:CreateTrainingJob"
			],
			"Resource": [
				"arn:aws:sagemaker:{region}:{account-id}:training-job/*"
			]
		},
		{
			"Effect": "Allow",
			"Action": "iam:PassRole",
			"Resource": "arn:aws:iam::{account-id}:role/create-decision-tree-sagemaker-training-job-role"
		}
	]
}

Explanation: In this Lambda function, we have basic permissions such as writing logs to CloudWatch. Additionally, we need the CreateTrainingJob permission to create a training job and PassRole because we will pass a role into the create_training_job method. This role is the one we previously created to execute the training job.

Step 7: Click Next

review and save

Step 8: Click Save changes

Successfully updated:

update policy success

Step 9: Go back to the create-training-job-decision-tree-function function interface

return to function

Step 10: Switch to the Code tab and paste:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import json
import os
import boto3
from datetime import datetime

def lambda_handler(event, context):
    
    sagemaker = boto3.client('sagemaker')
    
    # Định nghĩa thông tin huấn luyện

    estimator = {
        'training_job_name': 'decision-test-000',
        'image_uri': '{account-id}.dkr.ecr.{region}.amazonaws.com/decision-tree:latest',
        'role': 'arn:aws:iam::{account-id}:role/create-decision-tree-sagemaker-training-job-role',
        'instance_count': 1,
        'instance_type': 'ml.m5.large',
        'input_data_path': 's3://customized-sagemaker-image-decision-tree-bucket/input/dataset/',
        'input_model_path': 's3://customized-sagemaker-image-decision-tree-bucket/input/model/',
        'output_path': 's3://customized-sagemaker-image-decision-tree-bucket/output',
        'volumn_size_in_GB': 30,
        'max_runtime_in_second': 3600
    }
    
    print(estimator)
    
    # Thực hiện huấn luyện
    try:
        training_job_name = f"{estimator['training_job_name']}"
        response = sagemaker.create_training_job(
            TrainingJobName=training_job_name,
            AlgorithmSpecification={
                'TrainingImage': estimator['image_uri'],
                'TrainingInputMode': 'File'
            },
            RoleArn=estimator['role'],
            InputDataConfig=[
                {
                    'ChannelName': 'dataset',
                    'DataSource': {
                        'S3DataSource': {
                            'S3DataType': 'S3Prefix',
                            'S3Uri': estimator['input_data_path'],
                            'S3DataDistributionType': 'ShardedByS3Key'
                        }
                    }
                },
                {
                    'ChannelName': 'model',
                    'DataSource': {
                        'S3DataSource': {
                            'S3DataType': 'S3Prefix',
                            'S3Uri': estimator['input_model_path'],
                            'S3DataDistributionType': 'FullyReplicated'
                        }
                    }
                },
            ],
            OutputDataConfig={
                'S3OutputPath': estimator['output_path']
            },
            ResourceConfig={
                'InstanceType': estimator['instance_type'],
                'InstanceCount': estimator['instance_count'],
                'VolumeSizeInGB': estimator['volumn_size_in_GB']
            },
            StoppingCondition={
                'MaxRuntimeInSeconds': estimator['max_runtime_in_second']
            }
        )
        print(response)
        
        return {
            'status': 'InProgress'
        }
    
    except Exception as e:
        print(e)
        return {
            'status': json.dumps({'error': str(e)})
        }

Explanation of the create_training_job function:

TrainingJobName=training_job_name: Sets the name for the training job.

1
2
3
4
AlgorithmSpecification={
                'TrainingImage': estimator['image_uri'],
                'TrainingInputMode': 'File'
            }

passes image URI, which is the image pushed to ECR.

RoleArn=estimator['role']: Grants permissions for execution.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
InputDataConfig=[
                {
                    'ChannelName': 'dataset',
                    'DataSource': {
                        'S3DataSource': {
                            'S3DataType': 'S3Prefix',
                            'S3Uri': estimator['input_data_path'],
                            'S3DataDistributionType': 'ShardedByS3Key'
                        }
                    }
                },
                {
                    'ChannelName': 'model',
                    'DataSource': {
                        'S3DataSource': {
                            'S3DataType': 'S3Prefix',
                            'S3Uri': estimator['input_model_path'],
                            'S3DataDistributionType': 'FullyReplicated'
                        }
                    }
                },
            ]

Pass in two channels and name them dataset and model (these names correspond to the paths in the image: /opt/ml/input/data/dataset and /opt/ml/input/data/model).
For each S3Uri, all files with the specified prefix will be copied and mapped to the corresponding channel in the image.

1
2
3
OutputDataConfig={
                'S3OutputPath': estimator['output_path']
            }

Defines where the output files after training will be stored (corresponding to /opt/ml/output/data and /opt/ml/model). Note that they will be compressed into a .tar.gz file format.

1
2
3
4
5
6
7
8
ResourceConfig={
                'InstanceType': estimator['instance_type'],
                'InstanceCount': estimator['instance_count'],
                'VolumeSizeInGB': estimator['volumn_size_in_GB']
            },
            StoppingCondition={
                'MaxRuntimeInSeconds': estimator['max_runtime_in_second']
            }

Define the configurations of the instance that performs the training process.

Create Lambda Function: Endpoint Creation

In this section, we continue creating a Lambda function for the purpose of creating an endpoint. Similar to section 5.4.1,

we create a Lambda function named create-decision-tree-endpoint-function
the policy attached to this function’s role is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "logs:CreateLogGroup",
            "Resource": "arn:aws:logs:{region}:{account-id}:*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:{region}:{account-id}:log-group:/aws/lambda/create-decision-tree-endpoint-function:*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::{account-id}:role/create-decision-tree-sagemaker-endpoint-role"
        },
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateModel",
                "sagemaker:DescribeModel",
                "sagemaker:ListModels",
                "sagemaker:CreateEndpointConfig",
                "sagemaker:DescribeEndpointConfig",
                "sagemaker:ListEndpointConfigs",
                "sagemaker:CreateEndpoint",
                "sagemaker:DescribeEndpoint",
                "sagemaker:ListEndpoints"
            ],
            "Resource": [
                "arn:aws:sagemaker:{region}:{account-id}:endpoint/*",
                "arn:aws:sagemaker:{region}:{account-id}:endpoint-config/*",
                "arn:aws:sagemaker:{region}:{account-id}:model/*"
            ]
        }
    ]
}

Explanation: To create an endpoint, we first need to create a model and an endpoint-config.

The code content is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import json
import os
import boto3

def lambda_handler(event, context):
    # TODO implement
    sagemaker = boto3.client('sagemaker')
    try:
        name = 'decision-tree-endpoint-000'
        image = '{account-id}.dkr.ecr.{region}.amazonaws.com/decision-tree:latest'
        model_data_url = 's3://customized-sagemaker-image-decision-tree-bucket/input/model/model.tar.gz'
        execution_role_arn = 'arn:aws:iam::{account-id}:role/create-decision-tree-sagemaker-endpoint-role'
        instance_count = 1
        instance_type = 'ml.m5.large'
        
        model_name = name
        endpoint_config_name = name
        inference_endpoint_name = name
        
        
        
        response = sagemaker.create_model(
            ModelName=model_name,
            PrimaryContainer={
                'Image': image,
                'ModelDataUrl': model_data_url
            },
            ExecutionRoleArn=execution_role_arn,
        )
        
        response = sagemaker.create_endpoint_config(
            EndpointConfigName=endpoint_config_name,
            ProductionVariants=[
                {
                    'VariantName': 'AllTraffic',
                    'ModelName': model_name,
                    'InitialInstanceCount': instance_count,
                    'InstanceType': instance_type,
                },
            ],
        )
        
        response = sagemaker.create_endpoint(
            EndpointName=inference_endpoint_name,
            EndpointConfigName=endpoint_config_name,
        )
        
        print('[INFO] CREATE SUSSESSFULLY!')
        print(response)
        
        return {
            'status': 'Success',
        }
    
    except Exception as e:
        print('[INFO] FAIL TO CREATE ENDPOINT!')
        print(e)
        return {
            'status': 'Error'
        }

Create Lambda Function: Invoking Prediction to Endpoint

In this section, we will create a Lambda function with the functionality of sending a request to invoke a prediction at the endpoint.

Similar to the steps for creating a Lambda function:

We create a Lambda function named test
However, in the permissions section, we only need a policy in the following format:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": "logs:CreateLogGroup",
			"Resource": "arn:aws:logs:{region}:{account-id}:*"
		},
		{
			"Effect": "Allow",
			"Action": [
				"logs:CreateLogStream",
				"logs:PutLogEvents"
			],
			"Resource": [
				"arn:aws:logs:{region}:{account-id}:log-group:/aws/lambda/test:*"
			]
		},
		{
			"Sid": "VisualEditor0",
			"Effect": "Allow",
			"Action": "sagemaker:InvokeEndpoint",
			"Resource": [
			    "arn:aws:sagemaker:{region}:{account-id}:endpoint/*"
			    ]
		}
	]
}

Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import os
import io
import boto3
import json
import csv

runtime= boto3.client('runtime.sagemaker')

def lambda_handler(event, context):
    # TODO implement

    response = runtime.invoke_endpoint(EndpointName='decision-tree-endpoint-000',
                                      ContentType='text/csv',
                                      Body='1,2,3,4')
    response_body = response['Body'].read().decode('utf-8')
    print(response_body)
    preds = {"Prediction": response_body}
    print(preds)
    response_dict = {
          "statusCode": 200,
          "body": json.dumps(preds)
                }
    return response_dict

After that, click Deploy, and Test successfully.

Test

In this section, we will test whether the components we have created are functioning correctly.

Test Training Job Creation

Testing the training job creation function from the Lambda function

First, navigate to the Lambda function create-training-job-decision-tree-function.

Click Deploy, create a Test, then click Test. If successful, you will see:

search IAM

Check the process by searching for the Amazon SageMaker service.

In the Training section, select Training jobs.

search IAM

You will see decision-tree-000 in the InProcess state, meaning the training is in progress. Wait for it to transition to Completed.

search IAM

Click on decision-tree-000, and you will see:

search IAM

That means the process was successful!

Now, let’s check the output

Open S3, go to the bucket customized-sagemaker-image-decision-tree-bucket, and navigate to the output folder.

search IAM

You will see the output for decision-tree-000 was successfully created.

Click inside, and you will find the model.tar.gz file (which contains the trained model as defined in our code).

search IAM

Test Endpoint Creation

First, navigate to the Lambda function named create-decision-tree-endpoint-function.

In the Code section, click Deploy, create a test event, and then click Test.

If the endpoint creation is successful, you will see:

search IAM

Now, go to the Amazon SageMaker service and search for Inference → Endpoints:

View the list of endpoints:

search IAM

Click on the endpoint, and you will see that it is in the Creating state.

search IAM

After a while, the endpoint status will change to InService, indicating success. Now, we can proceed to test the function that invokes predictions on this endpoint.

search IAM

Test Invoke Prediction Function

After creating the endpoint, in this section, we will test the function that invokes predictions on that endpoint.

First, navigate to the Lambda function named test, click Deploy, create a test event, and then click Test. If successful, you will see:

search IAM

Bonus

By this point, we have successfully created an image for training and prediction. However, there is a challenge: if we want to modify the code in the train file, we would need to rebuild the image and push it to ECR. This process is extremely time-consuming, especially in real-world scenarios where images and models can be quite large.

To address this, we can implement a few modifications to separate the image from the execution file (specifically, the train file).

You might recall that in requirements.txt, we included awscli. The purpose of installing awscli is to allow interaction with S3 resources that store the train file or similar resources.

Thus, we can create a file named train.py with the same content as the original train file, except for removing the first line #!/usr/bin/env python. This file is then uploaded to the customized-sagemaker-image-decision-tree-bucket, inside the train-script/ folder.

search IAM

Next, we update the Code in the Lambda function create-training-job-decision-tree-function, specifically in the create_training_job section.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
AlgorithmSpecification={
                'TrainingImage': estimator['image_uri'],
                'TrainingInputMode': 'File',
                'ContainerEntrypoint': [
                    '/bin/bash',
                    '-c',
                    'aws s3 cp {}/{} /opt/ml/code/{} && python /opt/ml/code/{}'.format(
                        's3://customized-sagemaker-image-decision-tree-bucket/train-script',
                        'train.py',
                        'train.py',
                        'train.py'
                        )
                ]
            }

Thus, when the training job starts, the command will override the train execution, and train.py will be executed instead.

Clear Resources

After completing the lab, we will delete the resources.

S3
Go to S3, find the bucket named customized-sagemaker-image-decision-tree-bucket, and click Delete.
ECR
Go to ECR, find the repository named decision-tree, and click Delete.
SageMaker
Go to SageMaker, navigate to Inference → Endpoints. If any endpoint is in the InService state, select it, click Action, then Delete.
The remaining resources, such as Lambda functions, IAM roles, and policies, do not incur charges, so we can keep them for reference.