gptqmodel4.2.5
Published
Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
pip install gptqmodel
Package Downloads
Authors
Project URLs
Requires Python
>=3.9.0
Dependencies
- pytest
>=8.2.2; extra == "test"
- parameterized
; extra == "test"
- ruff
==0.13.0; extra == "quality"
- isort
==6.0.1; extra == "quality"
- vllm
>=0.8.5; extra == "vllm"
- flashinfer-python
>=0.2.1; extra == "vllm"
- sglang
[srt]>=0.4.6; extra == "sglang"
- flashinfer-python
>=0.2.1; extra == "sglang"
- bitblas
==0.0.1-dev13; extra == "bitblas"
- optimum
>=1.21.2; extra == "hf"
- intel_extension_for_pytorch
>=2.7.0; extra == "ipex"
- auto_round
>=0.3; extra == "auto-round"
- clearml
; extra == "logger"
- random_word
; extra == "logger"
- plotly
; extra == "logger"
- lm_eval
>=0.4.7; extra == "eval"
- evalplus
>=0.3.1; extra == "eval"
- triton
>=3.0.0; extra == "triton"
- uvicorn
; extra == "openai"
- fastapi
; extra == "openai"
- pydantic
; extra == "openai"
- mlx_lm
>=0.24.0; extra == "mlx"
GPT-QModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.
Latest News
- 09/16/2025 4.2.5:
hyb_act
renamed toact_group_aware
. Removed finickytorch
import withinsetup.py
. Packing bug fix and prebuilt Pytorch 2.8 whls. - 09/12/2025 4.2.0: ⨠New Models Support: Qwen3-Next, Apertus, Kimi K2, Klear, FastLLM, Nemotron H. New
fail_safe
boolean
toggle to.quantize()
to patch-fix non-activatedMoE
modules due to highly uneven MoE model training. Fixed LavaQwen2 compat. Patch fix GIL=0 cuda error for multi-gpu. Fix compat with autoround + new transformers. - 09/04/2025 4.1.0: ⨠Meituan LongCat Flash Chat, Llama 4, GPT-OSS (BF16), and GLM-4.5-Air support. New experiemental
mock_quantization
config to skip complex computational code paths during quantization to accelerate model quant testing. - 08/21/2025 4.0.0: š New Group Aware Reordering (GAR) support. New models support: Bytedance Seed-OSS, Baidu Ernie, Huawei PanGu, Gemma3, Xiaomi Mimo, Qwen 3/MoE, Falcon H1, GPT-Neo. Memory leak and multiple model compatibility fixes related to Transformers >= 4.54. Python >= 3.13t free-threading support added with near N x GPU linear scaling for quantization of MoE models and also linear N x Cpu Core scaling of packing stage. Early access Pytorch 2.8 fused-ops on Intel XPU for up to 50% speedup.
- 08/19/2025 4.0.0-dev
main
: Fix quantization memory usage due to some model's incorrect application ofconfig.use_cache
during inference. FixedTransformers
>= 4.54.0 compat which changed layer forward return signature for some models. - 08/18/2025 4.0.0-dev
main
: GPT-Neo model support. Memory leak fix in error capture (stacktrace) and fixedlm_head
quantization compatibility for many models. - 07/31/2025 4.0.0-dev
main
: New Group Aware Reordering (GAR) support and prelim Pytorch 2.8 fused-ops for Intel XPU for up to 50% speedup. - 07/03/2025 4.0.0-dev
main
: New Baidu Ernie and Huawei PanGu model support.
Archived News
-
07/02/2025 4.0.0-dev
main
: Gemma3 4B model compat fix. -
05/29/2025 4.0.0-dev
main
: Falcon H1 model support. Fixed Transformers4.52+
compat with Qwen 2.5 VL models. -
05/19/2025 4.0.0-dev
main
: Qwen 2.5 Omni model support. -
05/05/2025 4.0.0-dev
main
: Python 3.13t free-threading support added with near N x GPU linear scaling for quantization of MoE models and also linear N x Cpu Core scaling of packing stage. -
04/29/2025 3.1.0-dev (Now 4.)
main
: Xiaomi Mimo model support. Qwen 3 and 3 MoE model support. New arg forquantize(..., calibration_dataset_min_length=10)
to filter out bad calibration data that exists in public dataset (wikitext). -
04/13/2025 3.0.0: š New experimental
GPTQ v2
quantization option for improved model quantization accuracy validated byGSM8K_PLATINUM
benchmarks vs originalgptq
. NewPhi4-MultiModal
model support . New Nvidia Nemotron-Ultra model support. NewDream
model support. New experimentalmulti-gpu
quantization support. Reduced vram usage. Faster quantization. -
04/2/2025 2.2.0: New
Qwen 2.5 VL
model support. Newsamples
log column during quantization to track module activation in MoE models.Loss
log column now color-coded to highlight modules that are friendly/resistant to quantization. Progress (per-step) stats during quantization now streamed to log file. Autobfloat16
dtype loading for models based on model config. Fix kernel compile for Pytorch/ROCm. Slightly faster quantization and auto-resolve some low-level oom issues for smaller vram gpus. -
03/12/2025 2.1.0: ⨠New
QQQ
quantization method and inference support! New GoogleGemma 3
zero-day model support. New AlibabaOvis 2
VL model support. New AMDInstella
zero-day model model support. NewGSM8K Platinum
andMMLU-Pro
benchmarking suppport. Peft Lora training with GPTQModel is now 30%+ faster on all gpu and IPEX devices. Auto detect MoE modules not activated during quantization due to insufficient calibration data.ROCm
setup.py
compat fixes.Optimum
andPeft
compat fixes. FixedPeft
bfloat16
training. -
03/03/2025 2.0.0: š
GPTQ
quantization internals are now broken into multiple stages (processes) for feature expansion. SyncedMarlin
kernel inference quality fix from upstream. AddedMARLIN_FP16
, lower-quality but faster backend.ModelScope
support added. Logging and cli progress bar output has been revamped with sticky bottom progress. Fixedgeneration_config.json
save and load. Fixed Transformers v4.49.0 compat. Fixed compat of models withoutbos
. Fixedgroup_size=-1
andbits=3
packing regression. Fixed Qwen 2.5 MoE regressions. Added CI tests to track regression in kernel inference quality and sweep all bits/group_sizes. Delegate loggin/progressbar to LogBar pkg. Fix ROCm version auto detection insetup
install. -
02/12/2025 1.9.0: ā” Offload
tokenizer
fixes to Toke(n)icer pkg. Optimizedlm_head
quant time and vram usage. OptimizedDeepSeek v3/R1
model quant vram usage. FixedOptimum
compat regresion inv1.8.1
. 3x speed-up forTorch
kernel when using Pytorch >= 2.5.0 withmodel.optimize()
. Newcalibration_dataset_concat_size
option to enable calibration dataconcat
mode to mimic original GPTQ data packing strategy which may improve quant speed and accuracy for datasets likewikitext2
. -
02/08/2025 1.8.1: ā”
DeepSeek v3/R1
model support. New flexible weightpacking
: allow quantized weights to be packed to[int32, int16, int8]
dtypes.Triton
andTorch
kernels supports full range of newQuantizeConfig.pack_dtype
. Newauto_gc: bool
control inquantize()
which can reduce quantization time for small model with no chance of oom. NewGPTQModel.push_to_hub()
api for easy quant model upload to HF repo. Newbuffered_fwd: bool
control inmodel.quantize()
. Over 50% quantization speed-up for visual (vl) models.
Fixedbits=3
packing andgroup_size=-1
regression in v1.7.4. -
01/26/2025 1.7.4: New
compile()
api for ~4-8% inference tps improvement. Fasterpack()
for post-quantiztion model save.Triton
kernel validated for Intel/XPU
when Intel Triton packages are installed. Fixed Transformers (bug) downcasting tokenizer class on save. -
01/20/2025 1.7.3: New Telechat2 (China Telecom) and PhiMoE model support. Fixed
lm_head
weights duplicated in post-quantize save() for models with tied-embedding. -
01/19/2025 1.7.2: Effective BPW (bits per weight) will now be logged during
load()
. Reduce loading time on Intel Arc A770/B580XPU
by 3.3x. Reduce memory usage in MLX conversion and fix Marlin kernel auto-select not checking CUDA compute version. -
01/17/2025 1.7.0: š āØ
backend.MLX
added for runtime-conversion and execution of GPTQ models on Apple'sMLX
framework on Apple Silicon (M1+). Exports ofgptq
models tomlx
also now possible. We have addedmlx
exported models to huggingface.co/ModelCloud. āØlm_head
quantization now fully support by GPTQModel without external pkg dependency. -
01/07/2025 1.6.1: š New OpenAI api compatible end-point via
model.serve(host, port)
. Auto-enable flash-attention2 for inference. Fixedsym=False
loading regression. -
01/06/2025 1.6.0: ā”25% faster quantization. 35% reduction in vram usage vs v1.5. š AMD ROCm (6.2+) support added and validated for 7900XT+ GPU. Auto-tokenizer loader via
load()
api. For most models you no longer need to manually init a tokenizer for both inference and quantization. -
01/01/2025 1.5.1: š 2025! Added
QuantizeConfig.device
to clearly define which device is used for quantization: default =auto
. Non-quantized models are always loaded on cpu by-default and each layer is moved toQuantizeConfig.device
during quantization to minimize vram usage. Compatibility fixes forattn_implementation_autoset
in latest transformers. -
12/23/2024 1.5.0: Multi-modal (image-to-text) optimized quantization support has been added for Qwen 2-VL and Ovis 1.6-VL. Previous image-to-text model quantizations did not use image calibration data, resulting in less than optimal post-quantization results. Version 1.5.0 is the first release to provide a stable path for multi-modal quantization: only text layers are quantized.
-
12/19/2024 1.4.5: Windows 11 support added/validated. Ovis VL model support with image dataset calibration. Fixed
dynamic
loading. Reduced quantization vram usage. -
12/15/2024 1.4.2: MacOS
gpu
(Metal) andcpu
(M+) support added/validated for inference and quantization. Cohere 2 model support added. -
12/13/2024 1.4.1: Added Qwen2-VL model support.
mse
quantization control exposed inQuantizeConfig
. Monkey patchpatch_vllm()
andpatch_hf()
api added to allow Transformers/Optimum/PEFT and vLLM to correctly loaded GPTQModel quantized models while upstream PRs are in pending status. -
12/10/2024 1.4.0
EvalPlus
harness integration merged upstream. We now support bothlm-eval
andEvalPlus
. Added pure torchTorch
kernel. RefactoredCuda
kernel to beDynamicCuda
kernel.Triton
kernel now auto-padded for max model support.Dynamic
quantization now supports both positive+:
:default, and-:
negative matching which allows matched modules to be skipped entirely for quantization. Fixed auto-Marlin
kerenl selection. Added auto-kernel fallback for unsupported kernel/module pairs. Lots of internal refractor and cleanup in-preparation for transformers/optimum/peft upstream PR merge. Deprecated the saving ofMarlin
weight format sinceMarlin
supports auto conversion ofgptq
format toMarlin
during runtime. -
11/29/2024 1.3.1 Olmo2 model support. Intel XPU acceleration via IPEX. Model sharding Transformer compat fix due to api deprecation in HF. Removed triton dependency. Triton kernel now optionally dependent on triton pkg.
-
11/26/2024 1.3.0 Zero-Day Hymba model support. Removed
tqdm
androgue
dependency. -
11/24/2024 1.2.3 HF GLM model support. ClearML logging integration. Use
device-smi
and replacegputil
+psutil
depends. Fixed model unit tests. -
11/11/2024 š 1.2.1 Meta MobileLLM model support added.
lm-eval[gptqmodel]
integration merged upstream. Intel/IPEX cpu inference merged replacing QBits (deprecated). Auto-fix/patch ChatGLM-3/GLM-4 compat with latest transformers. New.load()
and.save()
api. -
10/29/2024 š 1.1.0 IBM Granite model support. Full auto-buildless wheel install from pypi. Reduce max cpu memory usage by >20% during quantization. 100% CI model/feature coverage.
-
10/12/2024 ⨠1.0.9 Move AutoRound to optional and fix pip install regression in v1.0.8.
-
10/11/2024 ⨠1.0.8 Add wheel for python 3.12 and cuda 11.8.
-
10/08/2024 ⨠1.0.7 Fixed marlin (faster) kernel was not auto-selected for some models.
-
09/26/2024 ⨠1.0.6 Fixed quantized Llama 3.2 vision quantized loader.
-
09/26/2024 ⨠1.0.5 Partial Llama 3.2 Vision model support (mllama): only text-layer quantization layers are supported for now.
-
09/26/2024 ⨠1.0.4 Integrated Liger Kernel support for ~1/2 memory reduction on some models during quantization. Added control toggle disable parallel packing.
-
09/18/2024 ⨠1.0.3 Added Microsoft GRIN-MoE and MiniCPM3 support.
-
08/16/2024 ⨠1.0.2 Support Intel/AutoRound v0.3, pre-built whl packages, and PyPI release.
-
08/14/2024 ⨠1.0.0 40% faster
packing
, Fixed Python 3.9 compat, addedlm_eval
api. -
08/10/2024 š 0.9.11 Added LG EXAONE 3.0 model support. New
dynamic
per layer/module flexible quantization where each layer/module may have different bits/params. Added proper sharding support tobackend.BITBLAS
. Auto-heal quantization errors due to small damp values. -
07/31/2024 š 0.9.10 Ported vllm/nm
gptq_marlin
inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models withFORMAT.GPTQ
. Auto calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized() called on pre-quantized models with non-supported backends. HF transformers depend updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference. -
07/25/2024 š 0.9.9: Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang, and more.
-
07/13/2024 š 0.9.8: Run quantized models directly using GPTQModel using fast
vLLM
orSGLang
backend! Both vLLM and SGLang are optimized for dyanamic batching inference for maximumTPS
(check usage under examples). Marlin backend also got full end-to-end in/out features padding to enhance current/future model compatibility. -
07/08/2024 š 0.9.7: InternLM 2.5 model support added.
-
07/08/2024 š 0.9.6: Intel/AutoRound QUANT_METHOD support added for a potentially higher quality quantization with
lm_head
module quantization support for even more vram reduction: format export toFORMAT.GPTQ
for max inference compatibility. -
07/05/2024 š 0.9.5: Cuda kernels have been fully deprecated in favor of Exllama(v1/v2)/Marlin/Triton.
-
07/03/2024 š 0.9.4: HF Transformers integration added and bug fixed Gemma 2 support.
-
07/02/2024 š 0.9.3: Added Gemma 2 support, faster PPL calculations on gpu, and more code/arg refractor.
-
06/30/2024 š 0.9.2: Added auto-padding of model in/out-features for exllama and exllama v2. Fixed quantization of OPT and DeepSeek V2-Lite models. Fixed inference for DeepSeek V2-Lite.
-
06/29/2024 š 0.9.1: With 3 new models (DeepSeek-V2, DeepSeek-V2-Lite, DBRX Converted), BITBLAS new format/kernel, proper batching of calibration dataset resulting > 50% quantization speedup, security hash check of loaded model weights, tons of refractor/usability improvements, bugs fixes and much more.
-
06/20/2924 ⨠0.9.0: Thanks for all the work from ModelCloud team and the opensource ML community for their contributions!
What is GPT-QModel?
GPTQ-Model is a production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF Transformers, vLLM, and SGLang.
Public and ModelCloud's internal tests have shown that GPTQ is on-par and/or exceeds other 4bit quantization methods in terms of both quality recovery and production-level inference speed for token latency and rps. GPTQ has the optimal blend of quality and inference speed you need in a real-world production deployment.
GPTQ-Model not only supports GPTQ but also QQQ, GPTQv2, Eora with more quantization methods and enhancements planned.
Quantization Support
GPTQ-Model is a modular design supporting multiple quantization methods and feature extensions.
Quantization Feature | GPTQ-Model | Transformers | vLLM | SGLang | Lora Training |
---|---|---|---|---|---|
GPTQ | ā | ā | ā | ā | ā |
EoRA | ā | ā | ā | ā | x |
GPTQ v2 | ā | ā | ā | ā |