Model Optimizer Note 1

Bug1: During the calibration, the weight never get its statistic

For weight-only quantization, calibrators DO run during quantize(), but they DON’T run weight quantizer forward passes through the model, they actually use the linear parameter while the calibration to avoid the quantization of the weight. Because what we want in the calibration is the real activation distribution instead of a distribution under quantized weight.

def residue_max_calibrate( model: nn.Module, forward_loop: ForwardLoop | None = None, distributed_sync: bool = True, block_size: int | None = None, residue_ratio: float | None = None, track_amax: bool | None = None, **kwargs, ):

    # TODO: this replacement should not be done here. and the apply algo params
    create_and_replace_residue_linear_on_the_fly(model)

    # Ensure algorithm parameters reach calibrators before stats collection
    _apply_algo_params_to_calibrators( model, residue_ratio=residue_ratio, block_size=block_size, track_amax=track_amax, )

    enable_stats_collection(model)

    if forward_loop is None:
        weight_only_quantize(model)
    else:
        forward_loop(model)

        # TODO: Investigate why enable_stats_collection() doesn't properly enable weight calibrators
        # Root cause analysis needed:
        # 1. Check if ModelOpt's TensorQuantizer.forward() calls calibrator.collect() for weight quantizers
        # 2. Verify if weight quantizers need special handling vs activation quantizers
        # 3. Determine if this is expected ModelOpt behavior or a bug in our integration
        # 4. Consider if we need to override TensorQuantizer.forward() in our custom quantizers
        # For now, manually collect weight statistics as a workaround

        # Manually collect weight statistics after forward loop
        # Weight calibrators don't get triggered during forward pass, only activation calibrators do
        for name, module in model.named_modules():
            if isinstance(module, ResidueQuantLinear):
                if hasattr(module, 'weight') and hasattr(module, 'weight_quantizer'):
                    wq = module.weight_quantizer
                    if hasattr(wq, '_calibrator') and wq._calibrator is not None:
                        # Manually trigger weight calibrator to collect stats from weight tensor
                        wq._calibrator.collect(module.weight.data)
                        if debug and name == "model.layers.0.self_attn.q_proj":
                            print(f"[RESIDUE_DEBUG][ALGO]   Collected weight stats for {name}")

    finish_stats_collection(model)

Here we can see a critical thing here that if forward_loop is Not None, then we will skip the weight_only_quantize. Let’s see what’s inside the weight_only_quantize.

def weight_only_quantize(model: nn.Module):
    """Just quantize the weights of the model."""
    seen_modules = set()
    for name, module in model.named_modules():
        if module in seen_modules:
            continue
        for weight_name in weight_attr_names(module):
            with enable_weight_access_and_writeback(module, model):
                weight_quantizer = getattr(
                    module, quantizer_attr_names(weight_name).weight_quantizer
                )
				# Critical: trigger the TensorQuantizer.__call__() -> trigger forward, and then trigger the calibrator.callect()
                weight_quantizer(getattr(module, weight_name))
        seen_modules.add(module)

So in conclusion, If we use the weight_only_quantize, it will trigger a weight_quantizer forward and therefore a calibrator.collect() and therefore the statistic will be stored in the quantizer.

How to fix?

If we look at the other calibrater, for example that `mse_calibrate, It’s very normal for a calibrator to use another calibrator inside. And therefore we can divide our calibrate into different part and do a multiple stage calibrate. We can collect the activation statistic in the first stage, and after that we collect weight statistic for the next stage.

How AWQ Handle its Logic in the Modelopt?

I think read the code written by the previous coder is a important thing to know how the library being managed. And AWQ is a pretty good example to learn. The core logic of these algorithm is in file modelopt/torch/quantization/model_calib.py. If you want to see the calibration method, you will only find there are only four core calibration pattern under the directory modelopt/torch/quantization/calib. And they need further exploration.

The core for today’s note is to study how the AWQ or other specified calibration being implemented. If we look into the code of the function awq_lite, we can get to know its core idea for how to implement a new calibration base on the max_calibration.

The first important thing for awq is to wrap the forward_loop function with extra stats. And they call this wrapped forward_loop in the awq_lite function. And they use max_calibrate(model, lambda model: None) to use the old logic in the old function, like amax sync and the distribute amax sync for input_quantizer. They have to use lambda model:None due to the redefine forward loop.

And it’s using a two phase forward loop. The first phase is for cache mode, which get the activation/weight magnitude per channel, which is stored in the act_scale and weight_scale. The second phase is the search mode, it find the most suitable alpha using grid search based on the previous calculated activation_scale and activation_scale.

And in the search phase, the core is the grid search based on the MSE loss. And there are only several alpha to be selected, for example it can be a list ranging from 0 to 1 with step to be 0.1. And it’s shared by the same block. So the search is pretty easy to just compare the loss induced by the different alpha and pick the best one.

Which should be implemented in the model_calib.py and which should be in the calib? The logic implemented in the calibrate class will:

Attached to the TensorQuantizer
And the calibration is called via quantizer._calibrator.collect(x)
The scope involved here is Per-quantizer (input quantizer, output quantizer and weight quantizer)
And the invocation is automatic during enable_stats_collected()
It can not modify the weight

The logic implemented in the model_calib.py will:

Attached to the entire model
calibration is called via customized forward loop.
The scope involved here is Model-wise or per-layer
It’s invoked in the forward.
It can output the configured model

Summary: The primary purpose of a calibrator class is to produce an amax value. It will

collect statistics via collect() during forward passes. (Attached to Tensor Quantizer)
computa amax via compute_amax()
And the amax is loaded into the attached quantizer via load_calib_amax().
The calibrator is muted if the quantizer is dynamic(for the input quantizer and other dynamic quantizer)

Trick: consistent axis handling

# Use ModelOpt's utilities for consistent axis handling
reduce_axis = quant_utils.convert_quantization_axis_to_reduce_axis(x, self._axis)

Mechanism: If using forward_loop in the calibration

If we use the forward_loop instead of none in the mtq.quantize, if we check the max_calibrate, we should know that the

weight quantizer will not collect statistic during forward pass
because the weight quantizer never show up in the forward pass.
And therefore the weight_quantizer.amax will be None

Summary: We have to enable weight quantizer _if_calib=True for the calibrator to collect statistics. Without it, the weight quantizer will have no amax.

If we pass forward_loop as None?

If we check the max_calibrate in the model_calib.py we will find that the weight_only_quantize() will be enable.

weight quantizer will be calibrated, and it’s a directly call instead of a forward pass.

The smooth in the AWQ?

For the AWQ, we know we will get a scale for each channel. And this scale can smooth the activation if we perform the quantization. therefore, the AWQ will reset the amax of the activation to smooth the quantization because of the applied scale.

Different Quant Linear in the ModelOpt

mtq.quantize() mtq.compress()

QuantLinear(Fake quantization)

The purpose for the quantlinear is for calibration/training During the mtq.quantization(), i.e. the calibration phase It will simulate precision loss but keeps weights in high precision. And is trainable

RealQuantLinear

This actually compresses weights to lower precision. And it’s used after the mtq.compress(), it’s in the deployment/inference phase. The weight is stored as QTensorWrapper, and uses optimized GEMM kernels. And the weights are not trainable due to compression.

SVDQuantLinear(Specialized)

It’s advanced quantization using SVD-based decomposition and LoRA. This linear is for SVDQuant Algorithm specially. It maintains LoRA matrices to handle outliers, and more accurate for aggressive quantization.

Replacement of the Linear

The replacement of the linear happens in two phases, including mtq.quantize() and mtq.compress(). The mtq.quantize() will replace the nn.Linear to the QuantLinear (or SVDQuantLinear if using SVDQuant). And in the mtq.compression(), which is in file /quantization/compress.py. It change the QuantLinear to the RealQuantLinear. And sets the fake_quant = False for compressed layer.

# In compress.py
def compress_convert(model, config: CompressConfig, ...):
    # Register RealQuantLinear as replacement
    for _, module in model.named_modules():
        if is_quantized_linear(module):
            RealQuantModuleRegistry.register({type(module): module.__class__.__name__})(
                RealQuantLinear
            )
    
    # Replace modules
    _replace_quant_module(model, registry=RealQuantModuleRegistry)
    
    # Set fake_quant based on compression config
    set_quantizer_attribute(
        model, "*weight_quantizer*", {"fake_quant": not compress_cfg["default"]}
    )

The fake_quant parameter controls which quantization behavior is used, if fake_quant set to be true, it will

Simulate quantization(QDQ)
Keep high precision
Used during calibration and training And if fake_quant is set to be False, it will
Activates real quantization
compress the weights
used for inference and deployment

Note: The compress and the export is using different road path. But the export will handle the not compressed weight after you call mtq.quantize() which did not do the real compression. So if you do the mtq.quantize() and then export, the export will handle the compression logic.

How to register a Customized Linear?

We should use mtq.register() to register our own linear and quantized linear. But this is only for mapping some customized linear to the specified quant linear. If we want to replace the _QuantLinear, we should use decorator @QuantModuleRegistry.register(nn.Linear: "nn.Linear") to replace the default _QuantLinear For example:

# Step 1: Define your custom linear module
class MyCustomLinear(nn.Linear):
    """Your custom linear implementation"""
    def __init__(self, in_features, out_features, bias=True, custom_param=1.0):
        super().__init__(in_features, out_features, bias)
        self.custom_param = custom_param

# Step 2: Define the quantized version
class QuantMyCustomLinear(_QuantLinear, MyCustomLinear):
    """Quantized version of MyCustomLinear"""
    
    def _setup(self):
        """REQUIRED: Initialize quantizers"""
        self.input_quantizer = TensorQuantizer(_QuantLinear.default_quant_desc_input)
        self.weight_quantizer = TensorQuantizer(_QuantLinear.default_quant_desc_weight)
        self.output_quantizer = TensorQuantizer(_QuantLinear.default_quant_desc_output)
        self.output_quantizer.disable()
    
    def forward(self, x): # Most of the time, we should use the forward inhereted from tha base class
        # Quantize inputs and weights
        x = self.input_quantizer(x)
        weight = self.weight_quantizer(self.weight)
        # Use quantized tensors
        output = nn.functional.linear(x, weight, self.bias)
        return self.output_quantizer(output)

# Step 3: Register the mapping
mtq.register(original_cls=MyCustomLinear, quantized_cls=QuantMyCustomLinear)

# Step 4: Use it
model = YourModel()  # Contains MyCustomLinear instances
model = mtq.quantize(model, mtq.INT8_DEFAULT_CFG, forward_loop)
# All MyCustomLinear → QuantMyCustomLinear automatically!

The default _QuantLinear is in file quant_linear.py, and it’s doing:

@QuantModuleRegistry.register({nn.Linear: "nn.Linear"})
class _QuantLinear(QuantLinearConvBase):
    """Quantized base class for nn.Linear type classes."""

    default_quant_desc_weight = tensor_quant.QUANT_DESC_8BIT_LINEAR_WEIGHT_PER_ROW

    @staticmethod
    def quantized_linear_fn(package, func_name, self, input, weight, *args, **kwargs):
        """Quantized version of a generic linear functional."""
        output = getattr(package, func_name)(
            self.input_quantizer(input),
            self.weight_quantizer(weight),
            *args,
            **kwargs,
        )
        return self.output_quantizer(output)

And there are three different class for different layer. QuantInputBase. This class is for input quantization only. It quantizes the input, not the weight.

class QuantInputBase(QuantModule):
    """Base class for modules where the input is quantized."""
    
    def forward(self, input, *args, **kwargs):
        """Quantize the input before calling the original forward method."""
        input = self.input_quantizer(input)  # ← Quantize input
        output = super().forward(input, *args, **kwargs)  # ← Call original nn.Module forward
        if isinstance(output, tuple):
            return (self.output_quantizer(output[0]), *output[1:])
        return self.output_quantizer(output)

Used for: BatchNorm, InstanceNorm, Pooling Layers

QuantLinearConvBase: Input + weight Quantization It quantize both input and weight automatically.

class QuantLinearConvBase(QuantInputBase):
    """Base class for quantized linear modules.
    
    Quantized linear modules are modules where both the input and the weight are quantized.
    """
    
    def forward(self, input, *args, **kwargs):
        """Quantize the input and the weight before calling the original forward method."""
        with self.quantize_weight():  # ← Enables weight quantization
            return super().forward(input, *args, **kwargs)
            # This calls QuantInputBase.forward() which quantizes input
            # Then calls nn.Linear.forward() which accesses self.weight

_LegacyQuantLinearConvBaseMixin. For legacy support It provides __init__ for backward compatibility

So basically, we just need to use the forward method in the QuantLinearConvBase.

Note: This is only for QuantLinear, therefore the RealQuant Linear is supported in another way.

Why the SVDQuant registration is so different? And how it is used?

The SVDQuant is Framework-agnostic, and it adds lora to the linear. The SVDQuant is handled in separate stages:

Step1: It does the base quantization, and it use default QuantModuleRegistry to change the normal nn.Linear to the QuantLinear.
Step2: This is a algorithm specific enhancement, and it uses the create_and_replace_svdquant_linear_on_the_fly to change the QuantLinear to the SVDQuantLinear, to add LoRA adapters. It’s a very different way. It’s not like a replacement, it’s a addition. The registration code in:

ResidueQuantModuleRegistry = _DMRegistryCls("SVDQuant") # Prefixed name for the generated class is define here
def create_and_replace_svdquant_linear_on_the_fly(model):
    # ... registration loop ...
    
    print("Replacing instances of QuantLinear with SVDQuantLinear.")
    _replace_quant_module(
        model, 
        version=ModeloptStateManager(model).state_version, 
        registry=SVDQuantModuleRegistry  # ← Use SVDQuant registry!
    )

This will generate classes, based on the class provided. For example, there are different QuantLinear including _QuantLinear, QuantMegatronLinear. This registration will generate new class SVDQuantQuantLinear and QuantMegatronLinear. And in the end, for the SVDQuant situation, we get:

class SVDQuantQuantMegatronLinear(SVDQuantLinear, DynamicModule, QuantMegatronLinear):
    pass

# MRO:
# SVDQuantQuantMegatronLinear
#   → SVDQuantLinear         (SVD features: LoRA residual)
#     → QuantLinearConvBase   (weight quantization)
#       → QuantInputBase      (input quantization)  
#         → QuantModule        (base quant module)
#   → DynamicModule           (dynamic module framework)
#   → QuantMegatronLinear     (Megatron-specific logic!) ← PRESERVED!
#     → _ParallelLinear       (parallel linear base)
#       → QuantModule
#     → MegatronColumnParallelLinear  (original Megatron)

And if the method modelopt_post_restore only used in the megatron being called, if will search, until the method in the QuantMegatronLinear being found.

How to load a SVD Quant model?

In the convention.py, we have function for the different model. And here is a function specified for the SVD model.

def restore_svdquant_model(model: nn.Module, config: QuantizeConfig, metadata: MetadataDict):
    """Restore the svdquant states from the given state dict."""
    create_and_replace_svdquant_linear_on_the_fly(model)
    restore_quantizer_state(model, config, metadata)
    return model