Oven logo

Oven

sgl-kernel0.3.16.post4

Published

Kernel Library for SGLang

pip install sgl-kernel

Package Downloads

Weekly DownloadsMonthly Downloads

Authors

Requires Python

>=3.10

Dependencies

    SGL Kernel

    Kernel Library for SGLang

    License: Apache-2.0 PyPI

    SGL Kernel provides optimized compute primitives for the SGLang framework, enabling efficient inference for large language models and vision-language models through custom kernels for operations.

    Installation

    Requires torch == 2.8.0

    # Latest version
    pip3 install sgl-kernel --upgrade
    

    Building from Source

    Requires

    • CMake ≥3.31,
    • Python ≥3.10
    • scikit-build-core
    • ninja(optional)

    Use Makefile to build sgl-kernel

    make build
    

    Contribution

    Steps to add a new kernel:

    1. Implement the kernel in csrc
    2. Expose the interface in include/sgl_kernel_ops.h
    3. Create torch extension in csrc/common_extension.cc
    4. Update CMakeLists.txt to include new CUDA source
    5. Expose Python interface in python
    6. Add test and benchmark

    Development Tips

    1. When creating torch extensions, add the function definition with m.def, and device binding with m.impl:
    • How to write schema: Schema reference

      // We need def with schema here for torch.compile
      m.def(
       "bmm_fp8(Tensor A, Tensor B, Tensor! D, Tensor A_scale, Tensor B_scale, Tensor workspace_buffer, int "
       "cublas_handle, int cuda_stream) -> ()");
      m.impl("bmm_fp8", torch::kCUDA, &bmm_fp8);
      

    Adapting C++ Native Types for Torch Compatibility

    Third-party C++ libraries often use int and float, but PyTorch bindings require int64_t and double due to Python's type mapping.

    Use make_pytorch_shim from sgl_kernel_torch_shim.h to handle conversions automatically:

    
    // Add type conversion for int -> int64_t
    template <>
    struct pytorch_library_compatible_type<int> {
      using type = int64_t;
      static int convert_from_type(int64_t arg) {
        TORCH_CHECK(arg <= std::numeric_limits<int>::max(), "value too large");
        TORCH_CHECK(arg >= std::numeric_limits<int>::min(), "value too small");
        return arg;
      }
    };
    
    // Wrap your function
    m.impl("fwd", torch::kCUDA, make_pytorch_shim(&mha_fwd));
    

    Testing & Benchmarking

    1. Add pytest tests in tests/, if you need to skip some test, please use @pytest.mark.skipif
    @pytest.mark.skipif(
        skip_condition, reason="Nvfp4 Requires compute capability of 10 or above."
    )
    
    1. Add benchmarks using triton benchmark in benchmark/

      We recommend using triton.testing.do_bench_cudagraph for kernel benchmarking:

      Compared to triton.testing.do_bench, do_bench_cudagraph provides:

      • Reduced CPU overhead impact for more accurate kernel performance measurements
      • Incorporation of PDL (Programmatic Dependent Launch) effects into individual kernel results
      • More realistic performance data on PDL-supported architectures (SM >= 90)
    2. Run test suite

    FAQ