GPTQ Note
Abstract
GPTQ Method Summary
Questions
- It said that “As a consequence, the set of unquantized weights F and similarly H_F^{-1} is always the same for all rows.” I don’t quite understand it.
- What’ the procedure of this quantization specifically? I mean what’s the Inverse Layer Hessain, and weight matrix / block. I think I can see that the inverse layer hessian seems to follow a row major way, and the weight matrix follows a column major way?
- In the background part, it mentions that they assume that the quantization grid of W is fixed before the process, and that individual weights can move freely as in. I don’t quite understand this.
- It still need to read more things about the hessian thing. First question definitely to be why hessian looks like this. Second, why there are full-precision weights involved?
- I found it’s hard to read these formula. What’s the qq and q here?
- More numerically stable than direct matrix inversion if we use Cholesky Decomposition. But why?
- Why small eigenvalues often correspond to noise rather than signal, and why it says directions with little statistical support get massive influence?
Summary
- It found that the order does not matter when do the quantization one by one in the pretty large language model. So it draw a insight that any fixed order may perform well, especially on large models.
Dampening Part
In GPTQ quantization, when a weight is quantized, remaining weights must be updated to compensate for the quantization error.
And from the perspective of Matrix, we get matrix conditioning theory to help.
Matrix conditioning theory: For any invertible matrix A: $$ \kappa(A) = |A||A^{-1}| $$ This is the definition of the condition number of an invertible matrix A.
And since we know: $$ \begin{align} |A| &= \sigma_{max}(A) \ |A^{-1}| &= \sigma_{max}(A^{-1}) = 1/\sigma_{min}(A) \ \kappa(A) &= |A||A^{-1}| = \sigma_{max}(A) / \sigma_{min} (A) \end{align} $$
Alternative form: $$ \kappa (A) = \lambda_{max}(A) / \lambda_{min} (A) = \sigma_{max}(A) / \sigma_{min}(A) $$ Since for symmetric positive definite matrices, we have: $$ \begin{align} A^T A &= A^2 \ \lambda_i(A^2) &= \lambda_i(A)^2 \ A^TAv &= A^T\lambda v = \lambda Av = \lambda^2 v \ \sigma_i(A) &= \sqrt{\lambda_i(A^2)} = \sqrt{\lambda_i(A)^2} = \lambda_i(A) \end{align} $$
For a system Ax = b, with a error perturbation $\delta x$, we can see: $$ |\delta x| / |x| \leq \kappa(A) |\delta b| / |b| $$ This mean that condition number bounds relative error amplification.
Ill-conditioning problem This comes to a definition of ill-conditioning problem.
- Large condition number $\kappa(A)$
- Small eigenvalues, even some $\lambda_i = 0$
- Near-singular: $det(A) = 0$
We don’t have to explain the condition number thing, since large condition number means large bound of relative error amplification of a system Ax=b.
So why small eigenvalues cause huge updates? Idk, need more explaination.
Dampening
$H_{damped} = H + \lambda I$
Here lambda is the 1% of average diagonal value of H.
By using dampening, we can shift the original eigenvalue from $\lambda_1, \lambda_2, \cdots, \lambda_n$ to $\lambda_1 + \lambda, \lambda_2 + \lambda, \cdots, \lambda_n + \lambda$. Therefore, the $$ \begin{align} \kappa(H) = \lambda_{max} / \lambda_{min} \approx \lambda_{max}/0 = \infty \ \kappa(H + \lambda I) = (\lambda_{max} + \lambda) / (\lambda_{min} + \lambda) \end{align} $$ Therefore, we can prevent the problem leading by the small eigenvalue.
GPTQ uses Cholesky for numerical stability.
$H = LL^T$, where L is lower triangular.
The advantage of the Cholesky Decomposition is:
- More numerically stable than direct matrix inversion.
- Efficient for positive definite matrices
- Well-optimized kernel
When we invert a matrix A directly, the condition number of $A^{-1}$ is same as A, but the errors get amplified. With Cholesky decomposition, we are working with triangular matrices L that typically have better condition numbers than the original matrix A.
Direct matrix inversion using methods like Gaussian elimination involves many division operation and intermediate results that can accumulate floating-point errors.
In conclusion, using Cholesky decomposition to solve inversion can lead to less error compared with the Gaussian elimination.
In conclusion, the GPTQ algorithm optimize to be like: $H^{-1} = \text{Cholesky}((2XX^T + \lambda I)^{-1})$