sglang0.4.0.post2
Published
SGLang is yet another fast serving framework for large language models and vision language models.
pip install sglang
Package Downloads
Authors
Project URLs
Requires Python
>=3.8
Dependencies
- requests
- tqdm
- numpy
- IPython
- setproctitle
- aiohttp
; extra == "runtime-common"
- decord
; extra == "runtime-common"
- fastapi
; extra == "runtime-common"
- hf_transfer
; extra == "runtime-common"
- huggingface_hub
; extra == "runtime-common"
- interegular
; extra == "runtime-common"
- modelscope
; extra == "runtime-common"
- orjson
; extra == "runtime-common"
- outlines
<0.1.0,>=0.0.44; extra == "runtime-common"
- packaging
; extra == "runtime-common"
- pillow
; extra == "runtime-common"
- prometheus-client
>=0.20.0; extra == "runtime-common"
- psutil
; extra == "runtime-common"
- pydantic
; extra == "runtime-common"
- python-multipart
; extra == "runtime-common"
- pyzmq
>=25.1.2; extra == "runtime-common"
- torchao
>=0.7.0; extra == "runtime-common"
- gemlite
; extra == "runtime-common"
- uvicorn
; extra == "runtime-common"
- uvloop
; extra == "runtime-common"
- xgrammar
>=0.1.6; extra == "runtime-common"
- sglang
[runtime_common]; extra == "srt"
- torch
; extra == "srt"
- vllm
<=0.6.4.post1,>=0.6.3.post1; extra == "srt"
- cuda-python
; extra == "srt"
- flashinfer
==0.1.6; extra == "srt"
- sglang
[runtime_common]; extra == "srt-hip"
- torch
; extra == "srt-hip"
- vllm
==0.6.3.dev13; extra == "srt-hip"
- sglang
[runtime_common]; extra == "srt-xpu"
- sglang
[runtime_common]; extra == "srt-hpu"
- openai
>=1.0; extra == "openai"
- tiktoken
; extra == "openai"
- anthropic
>=0.20.0; extra == "anthropic"
- litellm
>=1.0.0; extra == "litellm"
- jsonlines
; extra == "test"
- matplotlib
; extra == "test"
- pandas
; extra == "test"
- sentence_transformers
; extra == "test"
- accelerate
; extra == "test"
- peft
; extra == "test"
- sglang
[srt]; extra == "all"
- sglang
[openai]; extra == "all"
- sglang
[anthropic]; extra == "all"
- sglang
[litellm]; extra == "all"
- sglang
[srt_hip]; extra == "all-hip"
- sglang
[openai]; extra == "all-hip"
- sglang
[anthropic]; extra == "all-hip"
- sglang
[litellm]; extra == "all-hip"
- sglang
[srt_xpu]; extra == "all-xpu"
- sglang
[openai]; extra == "all-xpu"
- sglang
[anthropic]; extra == "all-xpu"
- sglang
[litellm]; extra == "all-xpu"
- sglang
[srt_hpu]; extra == "all-hpu"
- sglang
[openai]; extra == "all-hpu"
- sglang
[anthropic]; extra == "all-hpu"
- sglang
[litellm]; extra == "all-hpu"
- sglang
[all]; extra == "dev"
- sglang
[test]; extra == "dev"
- sglang
[all_hip]; extra == "dev-hip"
- sglang
[test]; extra == "dev-hip"
- sglang
[all_xpu]; extra == "dev-xpu"
- sglang
[test]; extra == "dev-xpu"
- sglang
[all_hpu]; extra == "dev-hpu"
- sglang
[test]; extra == "dev-hpu"
| Blog | Documentation | Join Slack | Join Bi-Weekly Development Meeting | Slides |
News
- [2024/12] 🔥 SGLang v0.4: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs (blog).
- [2024/10] 🔥 The First SGLang Online Meetup (slides).
- [2024/09] SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision (blog).
- [2024/07] Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) (blog).
More
- [2024/02] SGLang enables 3x faster JSON decoding with compressed finite state machine (blog).
- [2024/04] SGLang is used by the official LLaVA-NeXT (video) release (blog).
- [2024/01] SGLang provides up to 5x faster inference with RadixAttention (blog).
- [2024/01] SGLang powers the serving of the official LLaVA v1.6 release demo (usage).
About
SGLang is a fast serving framework for large language models and vision language models. It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language. The core features include:
- Fast Backend Runtime: Provides efficient serving with RadixAttention for prefix caching, jump-forward constrained decoding, overhead-free CPU scheduler, continuous batching, token attention (paged attention), tensor parallelism, FlashInfer kernels, chunked prefill, and quantization (FP8/INT4/AWQ/GPTQ).
- Flexible Frontend Language: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
- Extensive Model Support: Supports a wide range of generative models (Llama, Gemma, Mistral, QWen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
- Active Community: SGLang is open-source and backed by an active community with industry adoption.
Getting Started
- Install SGLang
- Send requests
- Backend: SGLang Runtime (SRT)
- Frontend: Structured Generation Language (SGLang)
Benchmark And Performance
Learn more in our release blogs: v0.2 blog, v0.3 blog, v0.4 blog
Roadmap
Adoption and Sponsorship
The project is supported by (alphabetically): AMD, Baseten, Etched, Hyperbolic, Jam & Tea Studios, LinkedIn, Meituan, NVIDIA, RunPod, Stanford, UC Berkeley, xAI and 01.AI.
Acknowledgment and Citation
We learned from the design and reused code from the following projects: Guidance, vLLM, LightLLM, FlashInfer, Outlines, and LMQL. Please cite our paper, SGLang: Efficient Execution of Structured Language Model Programs, if you find the project useful.