sgl-kernel0.3.16.post4
Published
Kernel Library for SGLang
pip install sgl-kernel
Package Downloads
Authors
Project URLs
Requires Python
>=3.10
Dependencies
SGL Kernel
Kernel Library for SGLang
SGL Kernel provides optimized compute primitives for the SGLang framework, enabling efficient inference for large language models and vision-language models through custom kernels for operations.
Installation
Requires torch == 2.8.0
# Latest version
pip3 install sgl-kernel --upgrade
Building from Source
Requires
- CMake ≥3.31,
- Python ≥3.10
- scikit-build-core
- ninja(optional)
Use Makefile to build sgl-kernel
make build
Contribution
Steps to add a new kernel:
- Implement the kernel in csrc
- Expose the interface in include/sgl_kernel_ops.h
- Create torch extension in csrc/common_extension.cc
- Update CMakeLists.txt to include new CUDA source
- Expose Python interface in python
- Add test and benchmark
Development Tips
- When creating torch extensions, add the function definition with
m.def, and device binding withm.impl:
-
How to write schema: Schema reference
// We need def with schema here for torch.compile m.def( "bmm_fp8(Tensor A, Tensor B, Tensor! D, Tensor A_scale, Tensor B_scale, Tensor workspace_buffer, int " "cublas_handle, int cuda_stream) -> ()"); m.impl("bmm_fp8", torch::kCUDA, &bmm_fp8);
Adapting C++ Native Types for Torch Compatibility
Third-party C++ libraries often use int and float, but PyTorch bindings require int64_t and double due to Python's type mapping.
Use make_pytorch_shim from sgl_kernel_torch_shim.h to handle conversions automatically:
// Add type conversion for int -> int64_t
template <>
struct pytorch_library_compatible_type<int> {
using type = int64_t;
static int convert_from_type(int64_t arg) {
TORCH_CHECK(arg <= std::numeric_limits<int>::max(), "value too large");
TORCH_CHECK(arg >= std::numeric_limits<int>::min(), "value too small");
return arg;
}
};
// Wrap your function
m.impl("fwd", torch::kCUDA, make_pytorch_shim(&mha_fwd));
Testing & Benchmarking
- Add pytest tests in tests/, if you need to skip some test, please use
@pytest.mark.skipif
@pytest.mark.skipif(
skip_condition, reason="Nvfp4 Requires compute capability of 10 or above."
)
-
Add benchmarks using triton benchmark in benchmark/
We recommend using
triton.testing.do_bench_cudagraphfor kernel benchmarking:Compared to
triton.testing.do_bench,do_bench_cudagraphprovides:- Reduced CPU overhead impact for more accurate kernel performance measurements
- Incorporation of PDL (Programmatic Dependent Launch) effects into individual kernel results
- More realistic performance data on PDL-supported architectures (SM >= 90)
-
Run test suite
FAQ
- Q: Segmentation fault with CUDA 12.6
- A: Update ptxas to 12.8, reference: segment fault error