Skip to content

SmoothQuant

SmoothQuant is a post-training quantization method for large language models.

Overview

SmoothQuant provides:

  • Activation smoothing: Migrates quantization difficulty from activations to weights
  • INT8 inference: Enables efficient INT8 inference without significant accuracy loss
  • Easy integration: Works with existing models

Method

SmoothQuant smooths the activation outliers by migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation.

Y = (Xdiag(s)^-1) · (diag(s)W) = X̂Ŵ

where s is a smoothing factor that balances the quantization difficulty.

Usage

from mindnlp.quant.smooth_quant import SmoothQuantConfig

# Configure SmoothQuant
config = SmoothQuantConfig(
    smooth_alpha=0.5,  # Smoothing factor
    calibration_samples=512
)

# Apply to model
# ... (implementation depends on model type)

Parameters

  • smooth_alpha: Controls how much difficulty is migrated from activations to weights (0.0-1.0)
  • calibration_samples: Number of samples for calibration

References

Notes

  • Requires calibration data for optimal results
  • Works best with transformer models
  • Achieves INT8 quantization with minimal accuracy degradation