Oven logo

Oven

SageMaker Studio

SageMaker Studio is an open source library for interacting with Amazon SageMaker Unified Studio resources. With the library, you can access these resources such as domains, projects, connections, and databases, all in one place with minimal code.

Table of Contents

  1. Installation
  2. Usage
    1. Setting Up Credentials and ClientConfig
      1. Using ClientConfig
    2. Domain 3. Domain Properties
    3. Project
      1. Properties
        1. IAM Role ARN
        2. KMS Key ARN
        3. MLflow Tracking Server ARN
        4. S3 Path
      2. Connections
        1. Connection Data
        2. Secrets
      3. Catalogs
      4. Databases and Tables
        1. Databases
        2. Tables
    4. Execution APIs
      1. Local Execution APIs
        1. StartExecution API
        2. GetExecution API
        3. ListExecutions API
        4. StopExecution API
      2. Remote Execution APIs
        1. StartExecution API
        2. GetExecution API
        3. ListExecutions API
        4. StopExecution API

1) Installation

The SageMaker Studio is built to PyPI, and the latest version of the library can be installed using the following command:

pip install sagemaker-studio

Supported Python Versions

SageMaker Studio supports Python versions 3.9 and newer.

Licensing

SageMaker Studio is licensed under the Apache 2.0 License. It is copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. The license is available at: http://aws.amazon.com/apache2.0/

2) Usage

Setting Up Credentials and ClientConfig

If SageMaker Studio is being used within Amazon SageMaker Unified Studio JupyterLab, the library will automatically pull your latest credentials from the environment.

If you are using the library elsewhere, or if you want to use different credentials within the SageMaker Unified Studio JupyterLab, you will need to first retrieve your SageMaker Unified Studio credentials and make them available in the environment through either:

  1. Storing them within an AWS named profile. If using a profile name other than default, you will need to supply the profile name by:
    1. Supplying it during initialization of the SageMaker Studio ClientConfig object
    2. Setting the AWS profile name as an environment variable (e.g. export AWS_PROFILE="my_profile_name")
  2. Initializing a boto3 Session object and supplying it when initializing a SageMaker Studio ClientConfig object
AWS Named Profile

To use the AWS named profile, you can update your AWS config file with your profile name and any other settings you would like to use:

[my_profile_name]
region = us-east-1

Your credentials file should have the credentials stored for your profile:

[my_profile_name]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
aws_session_token=IQoJb3JpZ2luX2IQoJb3JpZ2luX2IQoJb3JpZ2luX2IQoJb3JpZ2luX2IQoJb3JpZVERYLONGSTRINGEXAMPLE

Finally, you can pass in the profile when initializing the ClientConfig object.

from sagemaker_studio import ClientConfig

conf = ClientConfig(profile_name="my_profile_name")

You can also set the profile name as an environment variable:

export AWS_PROFILE="my_profile_name"
Boto3 Session

To use a boto3 Session object for credentials, you will need to initialize the Session and supply it to ClientConfig.

from boto3 import Session
from sagemaker_studio import ClientConfig

my_session = Session(...)
conf = ClientConfig(session=my_session)

Using ClientConfig

If using ClientConfig for supplying credentials or changing the AWS region name, the ClientConfig object will need to be supplied when initializing any further SageMaker Studio objects, such as Domain or Project. If using non prod endpoint for an AWS service, it can also be supplied in the ClientConfig. Note: In sagemaker space, datazone endpoint is by default fetched from the metadata json file.

from sagemaker_studio import ClientConfig, Project

conf = ClientConfig(region="eu-west-1")
proj = Project(config=conf)

Domain

Domain can be initialized as follows.

from sagemaker_studio import Domain

dom = Domain()

If you are not using the SageMaker Studio within SageMaker Unified Studio Jupyter Lab, you will need to provide the ID of the domain you want to use.

dom = Domain(id="123456")

Domain Properties

A Domain object has several string properties that can provide information about the domain that you are using.

dom.id
dom.root_domain_unit_id
dom.name
dom.domain_execution_role
dom.status
dom.portal_url

Project

Project can be initialized as follows.

from sagemaker_studio import Project

proj = Project()

If you are not using the SageMaker Studio within the SageMaker Unified Studio Jupyter Lab, you will need to provide either the ID or name of the project you would like to use and the domain ID of the project.

proj = Project(name="my_proj_name", domain_id="123456")

Project Properties

A Project object has several string properties that can provide information about the project that you are using.

proj.id
proj.name
proj.domain_id,
proj.project_status,
proj.domain_unit_id,
proj.project_profile_id
proj.user_id
IAM Role ARN

To retrieve the project IAM role ARN, you can retrieve the iam_role field. This gets the IAM role ARN of the default IAM connection within your project.

proj.iam_role
KMS Key ARN

If you are using a KMS key within your project, you can retrieve the kms_key_arn field.

proj.kms_key_arn

MLflow Tracking Server ARN

If you are using an MLflow tracking server within your project, you can retrieve the mlflow_tracking_server_arn field.

Usage

proj.mlflow_tracking_server_arn
S3 Path

One of the properties of a Project is s3. You can access various S3 paths that exist within your project.

# S3 path of project root directory
proj.s3.root

# S3 path of datalake consumer Glue DB directory (requires DataLake environment)
proj.s3.datalake_consumer_glue_db

# S3 path of Athena workgroup directory (requires DataLake environment)
proj.s3.datalake_athena_workgroup

# S3 path of workflows output directory (requires Workflows environment)
proj.s3.workflow_output_directory

# S3 path of workflows temp storage directory (requires Workflows environment)
proj.s3.workflow_temp_storage

# S3 path of EMR EC2 log destination directory (requires EMR EC2 environment)
proj.s3.emr_ec2_log_destination

# S3 path of EMR EC2 log bootstrap directory (requires EMR EC2 environment)
proj.s3.emr_ec2_certificates

# S3 path of EMR EC2 log bootstrap directory (requires EMR EC2 environment)
proj.s3.emr_ec2_log_bootstrap
Other Environment S3 Paths

You can also access the S3 path of a different environment by providing an environment ID.

proj.s3.environment_path(environment_id="env_1234")

Connections

You can retrieve a list of connections for a project, or you can retrieve a single connection by providing its name.

proj_connections: List[Connection] = proj.connections
proj_redshift_conn = proj.connection("my_redshift_connection_name")

Each Connection object has several properties that can provide information about the connection.

proj_redshift_conn.name
proj_redshift_conn.id
proj_redshift_conn.physical_endpoints[0].host
proj_redshift_conn.iam_role
Connection Data

To retrieve all properties of a Connection, you can access the data field to get a ConnectionData object. ConnectionData fields can be accessed using the dot notation (e.g. conn_data.top_level_field). For retrieving further nested data within ConnectionData, you can access it as a dictionary. (e.g. conn_data.top_level_field["nested_field"]).

conn_data: ConnectionData = proj_redshift_conn.data
red_temp_dir = conn_data.redshiftTempDir
lineage_sync = conn_data.lineageSync
lineage_job_id = lineage_sync["lineageJobId"]
spark_conn = proj.connection("my_spark_glue_connection_name")
id = spark_conn.id
env_id = spark_conn.environment_id
glue_conn = spark_conn.data.glue_connection_name
workers = spark_conn.data.number_of_workers
glue_version = spark_conn.data.glue_version

Catalogs

If your Connection is of the LAKEHOUSE or IAM type, you can retrieve a list of catalogs, or a single catalog by providing its id.

conn_catalogs: List[Catalog] = proj.connection("project.iam").catalogs
my_catalog: Catalog = proj.connection("project.iam").catalog("1234567890:catalog1/sub_catalog")
proj.connection("project.default_lakehouse").catalogs

Each Catalog object has several properties that can provide information about the catalog.

my_catalog.name
my_catalog.id
my_catalog.type
my_catalog.spark_catalog_name
my_catalog.resource_arn

Secrets

Retrieve the secret (username, password, other connection related metadata) for the connection using this property.

snowflake_connection: Connection = proj.connection("project.snowflake")
secret = snowflake_connection.secret

Secret can be a dictionary containing credentials, or a single string depending on the connection type.

Databases and Tables

Databases

Within a catalog, you can retrieve a list of databases, or a single database by providing its name.

my_catalog: Catalog

catalog_dbs: List[Database] = my_catalog.databases
my_db: Database = my_catalog.database("my_db")

Each Database object has several properties that can provide information about the database.

my_db.name
my_db.catalog_id
my_db.location_uri
my_db.project_id
my_db.domain_id
Tables

You can also retrieve either a list of tables or a specific table within a Database.

my_db_tables: List[Table] = my_db.tables
my_table: Table = my_db.table("my_table")

Each Table object has several properties that can provide information about the table.

my_table.name
my_table.database_name
my_table.catalog_id
my_table.location

You can also retrieve a list of the columns within a table. Column contains the column name and the data type of the column.

my_table_columns: List[Column] = my_table.columns
col_0: Column = my_table_columns[0]
col_0.name
col_0.type

Execution APIs

Execution APIs provide you the ability to start an execution to run a notebook headlessly either within the same user space or on remote compute.

Local Execution APIs

Use these APIs to start/stop/get/list executions within the user's space.

StartExecution

You can start a notebook execution headlessly within the same user space.

from sagemaker_studio.sagemaker_studio_api import SageMakerStudioAPI
from sagemaker_studio import ClientConfig

config = ClientConfig(overrides={
            "execution": {
                "local": True,
            }
        })
sagemaker_studio_api = SageMakerStudioAPI(config)

result = sagemaker_studio_api.execution_client.start_execution(
    execution_name="my-execution",
    input_config={"notebook_config": {
        "input_path": "src/folder2/test.ipynb"}},
    execution_type="NOTEBOOK",
    output_config={"notebook_config": {
        "output_formats": ["NOTEBOOK", "HTML"]
    }}
)
print(result)
GetExecution

You can retrieve details about a local execution using the GetExecution API.

from sagemaker_studio.sagemaker_studio_api import SageMakerStudioAPI
from sagemaker_studio import ClientConfig

config = ClientConfig(region="us-west-2", overrides={
            "execution": {
                "local": True,
            }
        })
sagemaker_studio_api = SageMakerStudioAPI(config)

get_response = sagemaker_studio_api.execution_client.get_execution(execution_id="asdf-3b998be2-02dd-42af-8802-593d48d04daa")
print(get_response)
ListExecutions

You can use ListExecutions API to list all the executions that ran in the user's space.

from sagemaker_studio.sagemaker_studio_api import SageMakerStudioAPI
from sagemaker_studio import ClientConfig

config = ClientConfig(region="us-west-2", overrides={
            "execution": {
                "local": True,
            }
        })
sagemaker_studio_api = SageMakerStudioAPI(config)

list_executions_response = sagemaker_studio_api.execution_client.list_executions(status="COMPLETED")
print(list_executions_response)
StopExecution

You can use StopExecution API to stop an execution that's running in the user space.

from sagemaker_studio.sagemaker_studio_api import SageMakerStudioAPI
from sagemaker_studio import ClientConfig

config = ClientConfig(region="us-west-2", overrides={
            "execution": {
                "local": True,
            }
        })
sagemaker_studio_api = SageMakerStudioAPI(config)

stop_response = sagemaker_studio_api.execution_client.stop_execution(execution_id="asdf-3b998be2-02dd-42af-8802-593d48d04daa")
print(stop_response)

Remote Execution APIs

Use these APIs to start/stop/get/list executions running on remote compute.

StartExecution

You can start a notebook execution headlessly on a remote compute specified in the StartExecution request.

from sagemaker_studio.sagemaker_studio_api import SageMakerStudioAPI
from sagemaker_studio import ClientConfig

config = ClientConfig(region="us-west-2")
sagemaker_studio_api = SageMakerStudioAPI(config)

result = sagemaker_studio_api.execution_client.start_execution(
    execution_name="my-execution",
    execution_type="NOTEBOOK",
    input_config={"notebook_config": {"input_path": "src/folder2/test.ipynb"}},
    output_config={"notebook_config": {"output_formats": ["NOTEBOOK", "HTML"]}},
    termination_condition={"max_runtime_in_seconds": 9000},
    compute={
        "instance_type": "ml.c5.xlarge",
        "image_details": {
            # provide either ecr_uri or (image_name and image_version)
            "image_name": "sagemaker-distribution-embargoed-prod",
            "image_version": "2.2",
            "ecr_uri": "ECR-registry-account.dkr.ecr.us-west-2.amazonaws.com/repository-name[:tag]",
        }
    }
)
print(result)
GetExecution

You can retrieve details about an execution running on remote compute using the GetExecution API.

from sagemaker_studio.sagemaker_studio_api import SageMakerStudioAPI
from sagemaker_studio import ClientConfig

config = ClientConfig(region="us-west-2")
sagemaker_studio_api = SageMakerStudioAPI(config)

get_response = sagemaker_studio_api.execution_client.get_execution(execution_id="asdf-3b998be2-02dd-42af-8802-593d48d04daa")
print(get_response)
ListExecutions

You can use ListExecutions API to list all the headless executions that ran on remote compute.

from sagemaker_studio.sagemaker_studio_api import SageMakerStudioAPI
from sagemaker_studio import ClientConfig

config = ClientConfig(region="us-west-2")
sagemaker_studio_api = SageMakerStudioAPI(config)

list_executions_response = sagemaker_studio_api.execution_client.list_executions(status="COMPLETED")
print(list_executions_response)
StopExecution

You can use StopExecution API to stop an execution that's running on remote compute.

from sagemaker_studio.sagemaker_studio_api import SageMakerStudioAPI
from sagemaker_studio import ClientConfig

config = ClientConfig(region="us-west-2")
sagemaker_studio_api = SageMakerStudioAPI(config)

stop_response = sagemaker_studio_api.execution_client.stop_execution(execution_id="asdf-3b998be2-02dd-42af-8802-593d48d04daa")
print(stop_response)