hmm... at Q4_K_M, stock-style quantization is retaining ~99–99.8% of BF16 accuracy, AutoRound pushes that to ~99.4–100.n% (??) the gap is roughly 0.1–0.7 percentage points
> at Q4_K_M, stock-style quantization is retaining ~99–99.8% of BF16 accuracy
I call bs on that. Not even FP8 is 99.8 in every scenario. It's close, but not quite bit exact, and to say that you reach 99% with q4 is a stretch. Maybe if all you test is really old benchmark questions that are in every training set out there, but go a bit ood and you'll see your q4 crumble. Try coding in a niche language or something. Or long context math (not 1+2 from the MATH benchmark) not in aime sets, and you'll get a few percentages of accuracy loss for each quant step.
Because the accuracy loss is pretty small in both cases, that’s still a pretty big relative improvement. I mean it’s around twice as good, right? I’m not sure how to interpret these percentage points from a usability point of view, though.
Claims of "we preserve 99.9999%" of accuracy are made in practically every quantization paper. The whole subfield acts like it's totally fine that they are testing on datasets that these models have fully trained on.
If we were in any other subfield doing this would be considered cheating and get your paper rejected, but the quantization community really loves to spread FUD claiming that quantization doesn't harm models
Also, similar dynamic with dense vs sparse MoE models. There's a reason we keep getting dense model releases alongside the MoEs out of China.
Quantization is not free, causes significant brain damage (especially on very long contexts), and has enough academic misconduct within it that it's actively screwing up the market. Don't believe me? Go ask your local financial analyst about the markets reaction to TurboQuant and than try to square that circle with this: https://openreview.net/forum?id=tO3ASKZlok (extreme and credible allegations of academic misconduct/fraud)
Most quant papers I've seen usually report non-trivial degradation on standard benchmarks, like 1-10% degradation (compared to FP16/BF16). Especially when using 4 bits or lower. For example, I just opened a random paper: https://arxiv.org/pdf/2410.09426 see Table 1.
p.s. dense vs MoE: both are being released because they offer different trade-offs: at the same level of quality, MoE will use less compute, but more memory.
the Turboquant debaucle is entirely on Google doing incredible amount of academic misconduct and general incompetence.
A) the QLJ thing they added is useless and the code they released didn't even include it since it makes the results worse
B) the blogpost about turboquant was ai generated and stated that turboquant used a polar transformation when it doesn't, so for the first 2 weeks people thought turboquant involved a transformation to polar coordinates. The reason the blogpost was wrong was probably because google tried to put their useless polarquant paper on the map by talking about it repeatedly.
C) since they don't use QLJ they wholesale copied the quantization technique from the HIGGS paper without citing it (https://arxiv.org/pdf/2411.17525)
You can try it with this model here: https://hugston.com/models/56tps-tested-autoround-qwen35-35b... which is really well done and can run pretty fast with ctx up to 300k. Just 11.65 GB. Get the Mmproj also for vision/image processing.
hmm... at Q4_K_M, stock-style quantization is retaining ~99–99.8% of BF16 accuracy, AutoRound pushes that to ~99.4–100.n% (??) the gap is roughly 0.1–0.7 percentage points
https://github.com/intel/auto-round/blob/main/docs/gguf_alg_...
> at Q4_K_M, stock-style quantization is retaining ~99–99.8% of BF16 accuracy
I call bs on that. Not even FP8 is 99.8 in every scenario. It's close, but not quite bit exact, and to say that you reach 99% with q4 is a stretch. Maybe if all you test is really old benchmark questions that are in every training set out there, but go a bit ood and you'll see your q4 crumble. Try coding in a niche language or something. Or long context math (not 1+2 from the MATH benchmark) not in aime sets, and you'll get a few percentages of accuracy loss for each quant step.
Because the accuracy loss is pretty small in both cases, that’s still a pretty big relative improvement. I mean it’s around twice as good, right? I’m not sure how to interpret these percentage points from a usability point of view, though.
taking x20 to x40 the time RTN took? looking at the table at the bottom
if so that's a pretty drastic trade-off
Claims of "we preserve 99.9999%" of accuracy are made in practically every quantization paper. The whole subfield acts like it's totally fine that they are testing on datasets that these models have fully trained on.
If we were in any other subfield doing this would be considered cheating and get your paper rejected, but the quantization community really loves to spread FUD claiming that quantization doesn't harm models
Also, similar dynamic with dense vs sparse MoE models. There's a reason we keep getting dense model releases alongside the MoEs out of China.
Quantization is not free, causes significant brain damage (especially on very long contexts), and has enough academic misconduct within it that it's actively screwing up the market. Don't believe me? Go ask your local financial analyst about the markets reaction to TurboQuant and than try to square that circle with this: https://openreview.net/forum?id=tO3ASKZlok (extreme and credible allegations of academic misconduct/fraud)
Most quant papers I've seen usually report non-trivial degradation on standard benchmarks, like 1-10% degradation (compared to FP16/BF16). Especially when using 4 bits or lower. For example, I just opened a random paper: https://arxiv.org/pdf/2410.09426 see Table 1.
p.s. dense vs MoE: both are being released because they offer different trade-offs: at the same level of quality, MoE will use less compute, but more memory.
the Turboquant debaucle is entirely on Google doing incredible amount of academic misconduct and general incompetence.
A) the QLJ thing they added is useless and the code they released didn't even include it since it makes the results worse
B) the blogpost about turboquant was ai generated and stated that turboquant used a polar transformation when it doesn't, so for the first 2 weeks people thought turboquant involved a transformation to polar coordinates. The reason the blogpost was wrong was probably because google tried to put their useless polarquant paper on the map by talking about it repeatedly.
C) since they don't use QLJ they wholesale copied the quantization technique from the HIGGS paper without citing it (https://arxiv.org/pdf/2411.17525)
D) the whole rabitQ thing https://openreview.net/forum?id=tO3ASKZlok and google's incompetent and tonedeaf response https://openreview.net/forum?id=tO3ASKZlok¬eId=X882cbyNNM after asking rabitQ for help, then shitting on them in their paper and ignoring their emails when they objected