GPTQ Note

Abstract

GPTQ Method Summary

Questions

It said that “As a consequence, the set of unquantized weights F and similarly H_F^{-1} is always the same for all rows.” I don’t quite understand it.
What’ the procedure of this quantization specifically? I mean what’s the Inverse Layer Hessain, and weight matrix / block. I think I can see that the inverse layer hessian seems to follow a row major way, and the weight matrix follows a column major way?
In the background part, it mentions that they assume that the quantization grid of W is fixed before the process, and that individual weights can move freely as in. I don’t quite understand this.
It still need to read more things about the hessian thing. First question definitely to be why hessian looks like this. Second, why there are full-precision weights involved?
I found it’s hard to read these formula. What’s the qq and q here?
More numerically stable than direct matrix inversion if we use Cholesky Decomposition. But why?
Why small eigenvalues often correspond to noise rather than signal, and why it says directions with little statistical support get massive influence?

Summary

It found that the order does not matter when do the quantization one by one in the pretty large language model. So it draw a insight that any fixed order may perform well, especially on large models.

Dampening Part

In GPTQ quantization, when a weight is quantized, remaining weights must be updated to compensate for the quantization error.

And from the perspective of Matrix, we get matrix conditioning theory to help.

Matrix conditioning theory: For any invertible matrix A: $$ \kappa(A) = |A||A^{-1}| $$ This is the definition of the condition number of an invertible matrix A.

And since we know: $$ \begin{align} |A| &= \sigma_{max}(A) \ |A^{-1}| &= \sigma_{max}(A^{-1}) = 1/\sigma_{min}(A) \ \kappa(A) &= |A||A^{-1}| = \sigma_{max}(A) / \sigma_{min} (A) \end{align} $$

Alternative form: $$ \kappa (A) = \lambda_{max}(A) / \lambda_{min} (A) = \sigma_{max}(A) / \sigma_{min}(A) $$ Since for symmetric positive definite matrices, we have: $$ \begin{align} A^T A &= A^2 \ \lambda_i(A^2) &= \lambda_i(A)^2 \ A^TAv &= A^T\lambda v = \lambda Av = \lambda^2 v \ \sigma_i(A) &= \sqrt{\lambda_i(A^2)} = \sqrt{\lambda_i(A)^2} = \lambda_i(A) \end{align} $$

For a system Ax = b, with a error perturbation $\delta x$, we can see: $$ |\delta x| / |x| \leq \kappa(A) |\delta b| / |b| $$ This mean that condition number bounds relative error amplification.

Ill-conditioning problem This comes to a definition of ill-conditioning problem.

Large condition number $\kappa(A)$
Small eigenvalues, even some $\lambda_i = 0$
Near-singular: $det(A) = 0$

We don’t have to explain the condition number thing, since large condition number means large bound of relative error amplification of a system Ax=b.

So why small eigenvalues cause huge updates? Idk, need more explaination.

Dampening

$H_{damped} = H + \lambda I$

Here lambda is the 1% of average diagonal value of H.

By using dampening, we can shift the original eigenvalue from $\lambda_1, \lambda_2, \cdots, \lambda_n$ to $\lambda_1 + \lambda, \lambda_2 + \lambda, \cdots, \lambda_n + \lambda$. Therefore, the $$ \begin{align} \kappa(H) = \lambda_{max} / \lambda_{min} \approx \lambda_{max}/0 = \infty \ \kappa(H + \lambda I) = (\lambda_{max} + \lambda) / (\lambda_{min} + \lambda) \end{align} $$ Therefore, we can prevent the problem leading by the small eigenvalue.

GPTQ uses Cholesky for numerical stability.

$H = LL^T$, where L is lower triangular.

The advantage of the Cholesky Decomposition is:

More numerically stable than direct matrix inversion.
Efficient for positive definite matrices
Well-optimized kernel

When we invert a matrix A directly, the condition number of $A^{-1}$ is same as A, but the errors get amplified. With Cholesky decomposition, we are working with triangular matrices L that typically have better condition numbers than the original matrix A.

Direct matrix inversion using methods like Gaussian elimination involves many division operation and intermediate results that can accumulate floating-point errors.

In conclusion, using Cholesky decomposition to solve inversion can lead to less error compared with the Gaussian elimination.

In conclusion, the GPTQ algorithm optimize to be like: $H^{-1} = \text{Cholesky}((2XX^T + \lambda I)^{-1})$