Intel AutoRound Joins LLM Compressor Streamlining Inference with Advanced Quantization

Intel AutoRound now integrates with LLM Compressor to optimize LLM inference. Learn how to quantize models with high accuracy and serve them on vLLM.
Intel AutoRound Joins LLM Compressor Streamlining Inference with Advanced Quantization

Streamlining LLM Inference AutoRound Joins LLM Compressor

Efficiently serving Large Language Models (LLMs) without compromise is always a challenge: a trade-off between speed and accuracy. With that approach, Intel intends to put the finishing touches by directly embedding its out-of-the-box algorithm, AutoRound, into LLM Compressor itself. This collaboration ensures a seamless workflow for developers to quantize models with the least possible loss of accuracy and serve them directly with vLLM.

What is AutoRound

Advanced Post-Training Quantization (PTQ) algorithm tailor-made for Generative AI models--these include LLMs and VLMs. Whereas standard quantization techniques are known to deteriorate the model, AutoRound introduces trainable parameters to optimize the rounding of weights.

It does this by analyzing the decoder layers sequentially adjusting rounding offsets and clipping ranges using signed gradient descent. This results in a quantized model that suffices with high accuracy even at low bit-widths. Developed by Intel, but it is hardware agnostic and can run on Intel Xeon, Gaudi accelerators, Arc GPUs, and all CUDA-based devices.

Key Benefits of Integration

Among such many distinct advantages that bringing AutoRound into the LLM Compressor ecosystem brings through into production pipelines:

  • High Fidelity Compression: Better accuracy at low bit scenarios (for example W4A16) than classical approaches.
  • Rapid Tuning: Optimization takes less than hundreds of steps instead of thousands, saving a lot in time and computing resources.
  • No Overhead: The resulting model has no extra latency that gets added on in inference.
  • Ready-To-Deploy: Such models will directly work with compressed-tensors for loading into vLLM for serving.

How It Works A Unified Workflow

It uses AutoRoundModifier within LLM Compressor to integrate itself. This allows users to take any dense model for calibration i.e. using a small dataset, export a compressed checkpoint that is ready for production for serving.

Quick Start Example

Below shows how to implement W4A16 quantization on a model such as Qwen/Qwen3-8B using the all-new workflow.

1. Setup and Calibration

First load your model and set up a small calibration dataset.

from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round.calib_dataset import get_dataset

MODEL_ID = "Qwen/Qwen3-8B"
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Prepare 128 samples for calibration
ds = get_dataset(tokenizer=tokenizer, seqlen=2048, nsamples=128)

2. Apply AutoRound Quantization

Define the recipe using the AutoRoundModifier and run the one-shot compression.

from llmcompressor import oneshot
from llmcompressor.modifiers.autoround import AutoRoundModifier

# Define the quantization recipe
recipe = AutoRoundModifier(
    targets="Linear",
    scheme="W4A16",
    ignore=["lm_head"],
    iters=200,
)

# Run compression
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=2048,
    num_calibration_samples=128,
)

# Store the reduced model
SAVE_DIR = "Qwen3-8B-W4A16-AutoRound"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

3. Serve with vLLM

Once saved, the model can be served in an efficient manner. Note that for Intel XPU deployment, the --enforce-eager flag is required.

vllm serve Qwen3-8B-W4A16-AutoRound --dtype=bfloat16 --gpu-memory-utilization 0.8

Future Roadmap

This is only the beginning since Intel plans to really augment AutoRound's functions in LLM Compressor:

  • New Data Types: Addition of FP8, MXFP8, MXFP4, and NVFP4 formats.
  • Hardware Scaling: Native support for newer generation hardware such as Intel Data Center GPU codenamed "Crescent Island".
  • Advanced Architectures: broader support for Mixture-of-Experts (MoE) models.
  • Automated Search for Mixed Bit: A search automation per layer precision for an accurate balance between accuracy and runtime efficiency.

Source: Intel

About the author

mgtid
Owner of Technetbook | 10+ Years of Expertise in Technology | Seasoned Writer, Designer, and Programmer | Specialist in In-Depth Tech Reviews and Industry Insights | Passionate about Driving Innovation and Educating the Tech Community Technetbook

Join the conversation