Qserver Note

🎤 Presentation Notes

Front Page

Hi, professor. My name is RONG Shuo. Today, I want to talk about the paper Q server. It’s a system design for efficient LLM Serving utilizing Quantization.

Slide: Structure of LLMs

Each layer in an LLM is composed of an attention block, a feed-forward network, and some normalization step.
The attention is computed by the softmax of QK transposed, multiplied by V.

Slide: Attention Mechanism

Q, K, and V are just linear projections of the input activations.

Slide: Feed-Forward Network in LLaMA

Standard FFNs use two linear layers with a ReLU non-linearity.
LLaMA improves this with SwiGLU gating.
Instead of ReLU, we use a gated structure: one branch goes through SiLU, another linear, then multiply them before projecting down.

Slide: Types of Quantization

Quantization is the process of mapping numbers from high-precision into lower precision set.
We can quantize weights like $w_q$, $w_k$, $w_v$, and $w_o$, or the FFN parameters.
Activation quantization compresses intermediate results like attention outputs. If we want to use tensor core of int8, we have to make sure both activation and weight are in int8 format.
KV cache quantization is a special case — compressing stored key/value tensors to reduce memory during long-context inference.

Slide: Quantization Methods

Quantization can be symmetric or asymmetric depending on how we define zero-points.
It can be done after training (PTQ) or integrated during training (QAT). QAT simulate quantization during training of LLM. Therefore it provides higher accuracy, but cost much time and computation power for retraining.
Data types: integer formats (INT8, INT4) or floating-point (FP8, FP4).
Granularity matters: per-tensor is coarser, per-channel more fine-grained, and per-group is a compromise.

Slide: Q Server Overview

Q Server is designed to accelerate LLM inference through quantization.
The focus is on reducing latency cause by attention and GEMM, which dominate runtime.

Slide: Performance Analysis

We can see the FP16 is with 312 TOPS. And It reaches the highest Performance when we gives about a batch in size of 156.
Because we need to perform the W4A16 setup GEMM on fp16 tensor core. The Best Performance of W4A16 should align with FP16. But the stored weight are 4x smaller than the FP16, so the Performance when using small batch should be 4x better, and saturate when batch is in size of 156/4 which is 39.
The W8A8 is the same. 2x Highest Performance and 2x Performance when using small batch.
And the authors draw a conclusion that W4A8 can be better.

Slide: W4A16 vs W8A8 vs W4A4

W4A16 saves bandwidth but requires dequantization, losing compute efficiency.
W8A8 is simpler and doubles effective bandwidth compared to FP16.
W4A4 struggles because GPUs don’t natively support INT4 tensor cores — extra dequantization on CUDA cores slows everything down.

Slide: QoQ Method

QoQ means “Quantization of Quantization.”
First, weights are symmetrically quantized to INT8.
Then, another quantization step brings them down to INT4.
Why? Because INT8 maps nicely to Tensor Core instructions, so INT4 can “piggyback” on INT8 without losing hardware support.

Slide: Smooth Attention

Problem: Key matrices have outlier channels that break quantization.
Solution: scale them down with a diagonal matrix, $\Lambda$.
This pushes the outliers into the weight matrices of Q and K instead, where quantization is easier.
The result: fewer extreme values in activations, more stable quantization.

Slide: Other General Optimization

Optimizations also happen at block level.
They rotate input features before quantization to distribute variance.
They only apply scaling on outputs — not inputs — because RMSNorm requires norm preservation.

Slide: Channel Reordering and Clipping

Another trick: reorder weight channels based on activation salience.
Grouping similar channels makes quantization groups more balanced.
Finally, they optimize clipping thresholds by minimizing the reconstruction error between original and quantized outputs.

Slide: Q Server System

In practice, Q Server uses W4A8.
Computation happens in INT8 tensor cores.
The result is first in INT32, then dequantized to FP16.
Since activations must be requantized for the next GEMM, they fuse quantization with layer normalization and FFN to avoid overhead.

Slide: Computation Aware Weight Reorder

A critical issue is why ldmatrix fails for W4A8.
Normally, ldmatrix loads data cooperatively into registers, reordered for Tensor Core.
But with 4-bit weights packed into bytes, alignment breaks: threads get the wrong fragments.
That means extra pointer arithmetic on CUDA cores, which is much slower.
In short: W8A8 works perfectly with ldmatrix, but W4A8 doesn’t, because storage granularity doesn’t match compute granularity.

Slide: Hardware Differences (QServe vs Blackwell)

On A100/H100, W4A8 requires software tricks: weight reorder, fused dequant, careful register use.
On Blackwell, FP4/FP6/FP8 are first-class citizens — hardware handles quantization inside Tensor Cores.
That means no more misalignment issues, no manual dequant, and near-peak efficiency.
The result is simpler kernels with higher performance.

![[attachments/Pasted image 20250930094953.png]]

![[attachments/Pasted image 20250930095119.png]]

Citation: Spin Quant

![[attachments/Pasted image 20250930101636.png]] ![[attachments/Pasted image 20250930101627.png]]

![[attachments/Pasted image 20250930101702.png]]