Streamlining LLM Inference AutoRound Joins LLM Compressor
Efficiently serving Large Language Models (LLMs) without compromise is always a challenge: a trade-off between speed and accuracy. With that approach, Intel intends to put the finishing touches by directly embedding its out-of-the-box algorithm, AutoRound, into LLM Compressor itself. This collaboration ensures a seamless workflow for developers to quantize models with the least possible loss of accuracy and serve them directly with vLLM.
What is AutoRound
Advanced Post-Training Quantization (PTQ) algorithm tailor-made for Generative AI models--these include LLMs and VLMs. Whereas standard quantization techniques are known to deteriorate the model, AutoRound introduces trainable parameters to optimize the rounding of weights.
It does this by analyzing the decoder layers sequentially adjusting rounding offsets and clipping ranges using signed gradient descent. This results in a quantized model that suffices with high accuracy even at low bit-widths. Developed by Intel, but it is hardware agnostic and can run on Intel Xeon, Gaudi accelerators, Arc GPUs, and all CUDA-based devices.
Key Benefits of Integration
Among such many distinct advantages that bringing AutoRound into the LLM Compressor ecosystem brings through into production pipelines:
- High Fidelity Compression: Better accuracy at low bit scenarios (for example W4A16) than classical approaches.
- Rapid Tuning: Optimization takes less than hundreds of steps instead of thousands, saving a lot in time and computing resources.
- No Overhead: The resulting model has no extra latency that gets added on in inference.
- Ready-To-Deploy: Such models will directly work with
compressed-tensorsfor loading into vLLM for serving.
How It Works A Unified Workflow
It uses AutoRoundModifier within LLM Compressor to integrate itself. This allows users to take any dense model for calibration i.e. using a small dataset, export a compressed checkpoint that is ready for production for serving.
Quick Start Example
Below shows how to implement W4A16 quantization on a model such as Qwen/Qwen3-8B using the all-new workflow.
1. Setup and Calibration
First load your model and set up a small calibration dataset.
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round.calib_dataset import get_dataset
MODEL_ID = "Qwen/Qwen3-8B"
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Prepare 128 samples for calibration
ds = get_dataset(tokenizer=tokenizer, seqlen=2048, nsamples=128)
2. Apply AutoRound Quantization
Define the recipe using the AutoRoundModifier and run the one-shot compression.
from llmcompressor import oneshot
from llmcompressor.modifiers.autoround import AutoRoundModifier
# Define the quantization recipe
recipe = AutoRoundModifier(
targets="Linear",
scheme="W4A16",
ignore=["lm_head"],
iters=200,
)
# Run compression
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=2048,
num_calibration_samples=128,
)
# Store the reduced model
SAVE_DIR = "Qwen3-8B-W4A16-AutoRound"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
3. Serve with vLLM
Once saved, the model can be served in an efficient manner. Note that for Intel XPU deployment, the --enforce-eager flag is required.
vllm serve Qwen3-8B-W4A16-AutoRound --dtype=bfloat16 --gpu-memory-utilization 0.8
Future Roadmap
This is only the beginning since Intel plans to really augment AutoRound's functions in LLM Compressor:
- New Data Types: Addition of FP8, MXFP8, MXFP4, and NVFP4 formats.
- Hardware Scaling: Native support for newer generation hardware such as Intel Data Center GPU codenamed "Crescent Island".
- Advanced Architectures: broader support for Mixture-of-Experts (MoE) models.
- Automated Search for Mixed Bit: A search automation per layer precision for an accurate balance between accuracy and runtime efficiency.
Source: Intel
