Model Hosting Container Standards - Python

A standardized Python framework for seamless integration between ML frameworks (TensorRT-LLM, vLLM) and Amazon SageMaker hosting.

Overview

This package simplifies model deployment by providing:

Unified Handler System: Consistent /ping and /invocations endpoints across frameworks
Flexible Configuration: Environment variables, decorators, or custom scripts
Framework Agnostic: Works with vLLM, TensorRT-LLM, and other ML frameworks
Production Ready: Comprehensive logging, error handling, and debugging tools

Quick Start

# Install
poetry install

# Framework integration (e.g., in vLLM server code)
import model_hosting_container_standards.sagemaker as sagemaker_standards
from fastapi import Request, Response
import json

@sagemaker_standards.register_ping_handler
async def ping(raw_request: Request) -> Response:
    """Ping check. Endpoint required for SageMaker"""
    return Response(
        content='{"status": "healthy", "source": "vllm_default"}',
        media_type="application/json",
    )

@sagemaker_standards.register_invocation_handler
@sagemaker_standards.inject_adapter_id("model")
async def invocations(raw_request: Request) -> Response:
    """Model invocations endpoint with LoRA adapter injection"""
    body_bytes = await raw_request.body()
    body = json.loads(body_bytes.decode()) if body_bytes else {}

    # Adapter ID injected by decorator from SageMakerLoRAApiHeader
    adapter_id = body.get("model", "base-model")

    # Your model inference logic here
    response_data = {
        "predictions": ["Generated text response"],
        "adapter_id": adapter_id,
    }

    return Response(
        content=json.dumps(response_data),
        media_type="application/json",
    )

# Customer customization (in model.py)
@sagemaker_standards.custom_ping_handler
async def custom_ping(raw_request: Request):
    return Response(status_code=200, content="Custom OK")

# Or simple functions (automatically discovered)
async def ping():
    return {"status": "healthy"}

Installation

# Install with Poetry (development)
poetry install

# Build wheel for distribution
poetry build

Requirements: Python >= 3.10, FastAPI >= 0.117.1

Usage Patterns

1. Framework Integration

For framework developers (e.g., vLLM, TensorRT-LLM), use register decorators to automatically set up routes:

import model_hosting_container_standards.sagemaker as sagemaker_standards
from fastapi import Request, Response

# Register decorators - automatically create /ping and /invocations routes
@sagemaker_standards.register_ping_handler
async def ping(request: Request) -> Response:
    """Framework ping handler with automatic routing."""
    return Response(status_code=200, content="OK")

@sagemaker_standards.register_invocation_handler
async def invocations(request: Request) -> dict:
    """Framework invocation handler with automatic routing."""
    body = await request.json()
    # Process your model inference here
    return {"result": "processed"}

# Optional: Add LoRA adapter support
@sagemaker_standards.register_invocation_handler
@sagemaker_standards.inject_adapter_id("model")  # Replace mode
async def invocations_with_lora(request: Request) -> dict:
    """Invocation handler with LoRA adapter ID injection."""
    body = await request.json()
    adapter_id = body.get("model", "base-model")  # Injected from header
    # Use adapter_id for model inference
    return {"result": f"processed with {adapter_id}"}

2. Customer Script Customization

For customers customizing model behavior, put this in your model artifact folder as model.py:

import model_hosting_container_standards.sagemaker as sagemaker_standards
from fastapi import Request
from fastapi.responses import Response

# Override decorators - immediately register handlers
@sagemaker_standards.custom_ping_handler
async def custom_ping(request: Request) -> Response:
    """Custom ping handler."""
    return Response(status_code=200, content="OK")

@sagemaker_standards.custom_invocation_handler
async def custom_invoke(request: Request) -> dict:
    """Custom invocation handler."""
    body = await request.json()
    # Process your model inference here
    return {"result": "processed"}

# Or use simple functions (automatically discovered)
async def custom_sagemaker_ping_handler():
    """Simple ping function - automatically discovered."""
    return {"status": "healthy"}

async def custom_sagemaker_invocation_handler(request: Request):
    """Simple invoke function - automatically discovered."""
    body = await request.json()
    return {"result": "processed"}

3. Environment Variable Configuration

# Point to custom handlers in your code
export CUSTOM_FASTAPI_PING_HANDLER="model.py:my_ping_function"
export CUSTOM_FASTAPI_INVOCATION_HANDLER="model.py:my_invoke_function"

# Or use absolute paths
export CUSTOM_FASTAPI_PING_HANDLER="/opt/ml/model/handlers.py:ping"

# Or use module
export CUSTOM_FASTAPI_INVOCATION_HANDLER="model:my_invoke_function" #`model` is alias to $SAGEMAKER_MODEL_PATH/$CUSTOM_SCRIPT_FILENAME
CUSTOM_FASTAPI_PING_HANDLER="vllm.entrypoints.openai.api_server:health"

4. Handler Resolution Priority

The system automatically resolves handlers in this order:

Environment Variables (highest priority)
Registry Decorators (@custom_ping_handler, @custom_invocation_handler - customer overrides)
Function Discovery (functions in custom script named custom_sagemaker_ping_handler, custom_sagemaker_invocation_handler)
Framework Register Decorators (@register_ping_handler, @register_invocation_handler)

Key Differences:

@register_ping_handler: Used by framework developers, automatically creates routes
@custom_ping_handler: Used by customers to override framework behavior
Function discovery: Simple functions automatically detected in customer scripts

Note: All handler detection and route setup happens automatically during bootstrap

Decorator Reference

Framework Decorators (for framework developers)

# Automatically create routes and register as framework defaults
@sagemaker_standards.register_ping_handler
@sagemaker_standards.register_invocation_handler

# LoRA adapter support
@sagemaker_standards.inject_adapter_id("model")  # Replace mode (default)
@sagemaker_standards.inject_adapter_id("model", append=True, separator=":")  # Append mode

Customer Decorators (for model customization)

# Override framework defaults (higher priority)
@sagemaker_standards.custom_ping_handler
@sagemaker_standards.custom_invocation_handler

# LoRA transform decorators
@sagemaker_standards.register_load_adapter_handler(request_shape={...}, response_shape={...})
@sagemaker_standards.register_unload_adapter_handler(request_shape={...}, response_shape={...})

# LoRA adapter injection modes
@sagemaker_standards.inject_adapter_id("model")  # Replace mode (default)
@sagemaker_standards.inject_adapter_id("model", append=True, separator=":")  # Append mode

Framework Examples

vLLM Framework Integration

For vLLM framework developers, use register decorators to set up default handlers:

# In vLLM server code (e.g., vllm/entrypoints/openai/api_server.py)
import model_hosting_container_standards.sagemaker as sagemaker_standards
from fastapi import APIRouter, FastAPI, Request, Response
import json

# Create router like real vLLM does
router = APIRouter()
@router.post("/ping", response_class=Response)
@router.get("/ping", response_class=Response)
@sagemaker_standards.register_ping_handler
async def ping(raw_request: Request) -> Response:
    """Default vLLM ping handler with automatic routing."""
    return Response(
        content='{"status": "healthy", "source": "vllm_default", "message": "Default ping from vLLM server"}',
        media_type="application/json",
    )
@router.post(
        "/invocations",
        dependencies=[Depends(validate_json_request)],
        responses={
            HTTPStatus.BAD_REQUEST.value: {"model": ErrorResponse},
            HTTPStatus.UNSUPPORTED_MEDIA_TYPE.value: {"model": ErrorResponse},
            HTTPStatus.INTERNAL_SERVER_ERROR.value: {"model": ErrorResponse},
        },
    )
@sagemaker_standards.register_invocation_handler
@sagemaker_standards.inject_adapter_id("model")
async def invocations(raw_request: Request) -> Response:
    """Default vLLM invocation handler with LoRA support."""
    # Get request body safely
    body_bytes = await raw_request.body()
    try:
        body = json.loads(body_bytes.decode()) if body_bytes else {}
    except (json.JSONDecodeError, UnicodeDecodeError):
        body = {}

    # Adapter ID injected by decorator from SageMakerLoRAApiHeader
    adapter_id = body.get("model", "base-model")

    # Process with vLLM engine (your actual vLLM logic here)
    # result = await vllm_engine.generate(body["prompt"], adapter_id=adapter_id)

    response_data = {
        "predictions": ["Generated text from vLLM"],
        "source": "vllm_default",
        "adapter_id": adapter_id,
        "message": f"Response using adapter: {adapter_id}",
    }

    return Response(
        content=json.dumps(response_data),
        media_type="application/json",
    )

# Alternative: append mode for model field
@sagemaker_standards.register_invocation_handler
@sagemaker_standards.inject_adapter_id("model", append=True, separator=":")
async def invocations_append_mode(raw_request: Request) -> Response:
    """vLLM invocation handler with adapter ID appending."""
    body_bytes = await raw_request.body()
    try:
        body = json.loads(body_bytes.decode()) if body_bytes else {}
    except (json.JSONDecodeError, UnicodeDecodeError):
        body = {}

    # If body has {"model": "Qwen-7B"} and header has "my-lora"
    # Result will be {"model": "Qwen-7B:my-lora"}
    model_with_adapter = body.get("model", "base-model")

    response_data = {
        "predictions": ["Generated text from vLLM"],
        "model_used": model_with_adapter,
        "message": f"Response using model: {model_with_adapter}",
    }

    return Response(
        content=json.dumps(response_data),
        media_type="application/json",
    )

# Setup FastAPI app like real vLLM
app = FastAPI(title="vLLM Server", version="1.0.0")
app.include_router(router)

# Bootstrap SageMaker routes at the end (IMPORTANT!)
from model_hosting_container_standards.sagemaker.sagemaker_router import setup_ping_invoke_routes
setup_ping_invoke_routes(app)

Customer vLLM Customization

Customers can override vLLM's default behavior using customer scripts (model.py):

import model_hosting_container_standards.sagemaker as sagemaker_standards
from model_hosting_container_standards.logging_config import logger
from fastapi.responses import Response
from fastapi import Request, HTTPException
from http import HTTPStatus
import json
import pydantic
from vllm.entrypoints.openai.protocol import CompletionRequest
from vllm.entrypoints.openai.serving_completion import OpenAIServingCompletion

# Customer override decorators - higher priority than framework register decorators
@sagemaker_standards.custom_ping_handler
async def myping(raw_request: Request):
    logger.info("Customer ping handler called")
    return Response(status_code=200, content="Customer ping OK")

@sagemaker_standards.custom_invocation_handler
async def invocations(raw_request: Request):
    """Customer invocation handler for SageMaker."""
    logger.info("Customer invocation handler called")
    try:
        body = await raw_request.json()
    except json.JSONDecodeError as e:
        raise HTTPException(
            status_code=HTTPStatus.BAD_REQUEST.value,
            detail=f"JSON decode error: {e}"
        ) from e

    # Custom processing logic
    result = await custom_model_processing(body)
    return result

# Or use simple functions (automatically discovered)
async def custom_sagemaker_ping_handler():
    """Simple ping function - automatically discovered."""
    return {"status": "healthy", "custom": True}

async def custom_sagemaker_invocation_handler(request: Request):
    """Simple invoke function - automatically discovered."""
    body = await request.json()
    # Custom model logic
    return {"result": "custom processing"}

logger.info("Customer handlers loaded - will override framework defaults")

Key Points:

✅ Framework Integration: Use @register_ping_handler for framework defaults
✅ Customer Overrides: Use @custom_ping_handler/@custom_invocation_handler or simple functions to customize
✅ Automatic Priority: Customer handlers automatically override framework defaults
✅ LoRA Support: Use @inject_adapter_id for adapter ID injection from headers

Adding Middleware to vLLM Integration

You can also add middleware to your vLLM integration:

import model_hosting_container_standards.sagemaker as sagemaker_standards
from model_hosting_container_standards.common.fastapi.middleware import custom_middleware, input_formatter, output_formatter
from model_hosting_container_standards.logging_config import logger

# Add throttling middleware
@custom_middleware("throttle")
async def rate_limit_middleware(request, call_next):
    # Simple rate limiting example
    client_ip = request.client.host
    logger.info(f"Processing request from {client_ip}")

    response = await call_next(request)
    response.headers["X-Rate-Limited"] = "true"
    return response

# Add request preprocessing
@input_formatter
async def preprocess_request(request):
    # Log incoming requests
    logger.info(f"Preprocessing request: {request.method} {request.url}")
    return request

# Add response postprocessing
@output_formatter
async def postprocess_response(response):
    # Add custom headers
    response.headers["X-Processed-By"] = "model-hosting-standards"
    return response

# Your existing handlers
@sagemaker_standards.custom_ping_handler
async def myping(raw_request: Request):
    logger.info("Custom ping handler called")
    return Response(status_code=201)

@sagemaker_standards.custom_invocation_handler
async def invocations(raw_request: Request):
    # Your invocation logic here
    pass

Example Commands

# Enable debug logging
SAGEMAKER_CONTAINER_LOG_LEVEL=DEBUG vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dtype auto

# Custom ping handler from model.py
CUSTOM_FASTAPI_PING_HANDLER=model.py:myping vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dtype auto

# Custom ping handler with absolute path
CUSTOM_FASTAPI_PING_HANDLER=/opt/ml/model/model.py:myping vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dtype auto

# Use vLLM's built-in health endpoint as ping handler
CUSTOM_FASTAPI_PING_HANDLER=vllm.entrypoints.openai.api_server:health vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dtype auto

# Add middleware via environment variables (file path)
CUSTOM_FASTAPI_MIDDLEWARE_THROTTLE=middleware.py:throttle_func vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dtype auto

# Add middleware via module path
CUSTOM_FASTAPI_MIDDLEWARE_THROTTLE=my_middleware:RateLimitClass vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dtype auto

# Combined middleware configuration
CUSTOM_FASTAPI_PING_HANDLER=model.py:myping \
CUSTOM_FASTAPI_MIDDLEWARE_THROTTLE=middleware_module:RateLimiter \
CUSTOM_PRE_PROCESS=processors:log_requests \
CUSTOM_POST_PROCESS=processors:add_headers \
vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dtype auto

Handler Path Formats:

model.py:function_name - Relative path
/opt/ml/model/handlers.py:ping - Absolute path
vllm.entrypoints.openai.api_server:health - Module path

Middleware Configuration

The package provides a flexible middleware system that supports both environment variable and decorator-based configuration.

Middleware Environment Variables

# Throttling middleware
export CUSTOM_FASTAPI_MIDDLEWARE_THROTTLE="throttle.py:rate_limit_middleware"

# Combined pre/post processing middleware
export CUSTOM_FASTAPI_MIDDLEWARE_PRE_POST_PROCESS="processing.py:combined_middleware"

# Using module paths (no file extension)
export CUSTOM_FASTAPI_MIDDLEWARE_THROTTLE="my_middleware_module:RateLimitMiddleware"
export CUSTOM_PRE_PROCESS="request_processors:log_and_validate"

# Separate pre/post processing (automatically combined)
export CUSTOM_PRE_PROCESS="preprocessing.py:pre_process_func"
export CUSTOM_POST_PROCESS="postprocessing.py:post_process_func"

Middleware Decorators

from model_hosting_container_standards.common.fastapi.middleware import (
    custom_middleware,
    input_formatter,
    output_formatter,
)

# Register throttle middleware
@custom_middleware("throttle")
async def my_throttle_middleware(request, call_next):
    # Rate limiting logic
    response = await call_next(request)
    return response

# Register combined pre/post middleware (function)
@custom_middleware("pre_post_process")
async def my_pre_post_middleware(request, call_next):
    # Pre-processing
    request = await pre_process(request)

    # Call next middleware/handler
    response = await call_next(request)

    # Post-processing
    response = await post_process(response)
    return response

# Register middleware class
@custom_middleware("throttle")
class ThrottleMiddleware:
    def __init__(self, app):
        self.app = app

    async def __call__(self, scope, receive, send):
        # ASGI middleware implementation
        # Rate limiting logic here
        await self.app(scope, receive, send)

# Register input formatter (pre-processing only)
@input_formatter
async def pre_process(request):
    # Modify request
    return request

# Register output formatter (post-processing only)
@output_formatter
async def post_process(response):
    # Modify response
    return response

Middleware Priority

Environment Variables > Decorators

Environment variables always take priority over decorator-registered middleware:

# This decorator will be ignored if CUSTOM_FASTAPI_MIDDLEWARE_THROTTLE is set
@custom_middleware("throttle")
async def decorator_throttle(request, call_next):
    return await call_next(request)

# Environment variable takes priority (can use module or file path)
# CUSTOM_FASTAPI_MIDDLEWARE_THROTTLE=throttle_module:ThrottleClass
# CUSTOM_FASTAPI_MIDDLEWARE_THROTTLE=env_throttle.py:env_throttle_func

Middleware Execution Order

Request → Throttle → Engine Middlewares → Pre/Post Process → Handler → Response

Configuration Reference

Environment Variables

from model_hosting_container_standards.common.fastapi.config import FastAPIEnvVars, FASTAPI_ENV_CONFIG
from model_hosting_container_standards.sagemaker import SageMakerEnvVars, SAGEMAKER_ENV_CONFIG

# FastAPI handler environment variables
FastAPIEnvVars.CUSTOM_FASTAPI_PING_HANDLER
FastAPIEnvVars.CUSTOM_FASTAPI_INVOCATION_HANDLER

# FastAPI middleware environment variables
FastAPIEnvVars.CUSTOM_FASTAPI_MIDDLEWARE_THROTTLE
FastAPIEnvVars.CUSTOM_FASTAPI_MIDDLEWARE_PRE_POST_PROCESS
FastAPIEnvVars.CUSTOM_PRE_PROCESS
FastAPIEnvVars.CUSTOM_POST_PROCESS

# SageMaker environment variables
SageMakerEnvVars.CUSTOM_SCRIPT_FILENAME
SageMakerEnvVars.SAGEMAKER_MODEL_PATH

Logging Control

The package provides centralized logging control using standard SageMaker environment variables.

By default, the package uses ERROR level logging, which effectively keeps it silent in production unless there are actual errors.

Log Level Configuration

# Set log level using SageMaker standard variable (recommended)
export SAGEMAKER_CONTAINER_LOG_LEVEL=DEBUG  # or INFO, WARNING, ERROR (default)

# Alternative: Use generic LOG_LEVEL variable
export LOG_LEVEL=INFO  # Falls back to this if SAGEMAKER_CONTAINER_LOG_LEVEL not set

Log Levels

ERROR (default): Only errors are logged - effectively silent in normal operation
WARNING: Errors and warnings
INFO: Informational messages, warnings, and errors
DEBUG: Detailed debug information including handler resolution

Log Format

All package logs use a consistent format:

[LEVEL] logger_name - filename:line: message

Examples

# Production: ERROR level by default (silent unless errors occur)
vllm serve model --dtype auto

# Development: Enable INFO level logging
SAGEMAKER_CONTAINER_LOG_LEVEL=INFO vllm serve model --dtype auto

# Debug mode: Enable detailed DEBUG logging
SAGEMAKER_CONTAINER_LOG_LEVEL=DEBUG vllm serve model --dtype auto

# Using alternative LOG_LEVEL variable
LOG_LEVEL=DEBUG vllm serve model --dtype auto

Note: These environment variables only control package logging. Your application's logging configuration is independent and unaffected.

Testing

Quick Endpoint Testing

# Start your service (example with vLLM)
vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dtype auto

# Test ping
curl -i http://127.0.0.1:8000/ping

# Test invocation
curl -X POST http://localhost:8000/invocations \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello!", "max_tokens": 50}'

Development

Quick Development Setup

# Install dependencies and dev tools
make install

# Install pre-commit hooks (recommended)
make pre-commit-install

# Run all checks
make all

Development Commands

make install           # Install dependencies
make format            # Format code (black, isort)
make lint              # Run linters (flake8, mypy)
make test              # Run test suite
make all               # Format, lint, and test
make clean             # Clean build artifacts

Code Quality Tools

Black (88 char line length) + isort for formatting
flake8 + mypy for linting and type checking
pytest for testing with coverage
pre-commit hooks for automated checks

Architecture

Package Structure

model_hosting_container_standards/
├── common/             # Common utilities
│   ├── fastapi/        # FastAPI integration & env config
│   ├── custom_code_ref_resolver/  # Dynamic code loading
│   └── handler/        # Handler specifications & resolution
│       └── spec/       # Handler interface definitions
├── sagemaker/          # SageMaker decorators & handlers
│   ├── lora/           # LoRA adapter support
│   │   ├── models/     # LoRA request/response models
│   │   └── transforms/ # API transformation logic
│   └── sessions/       # Stateful session management
├── config.py           # Configuration management
├── utils.py            # Utility functions
└── logging_config.py   # Centralized logging

Key Components

Handler Registry: Central system for registering and resolving handlers
Code Resolver: Dynamically loads handlers from customer code
Environment Config: Manages configuration via environment variables
Logging System: Comprehensive debug and operational logging

Contributing

When contributing to this project:

Follow the established code quality standards
Include comprehensive tests for new functionality
Update documentation and type hints
Run the full test suite before submitting changes
Use the provided development tools and pre-commit hooks

License

TBD

model-hosting-container-standards0.1.12

Package Downloads

Authors

Requires Python

Dependencies