11 points | by lastdong 3 hours ago
4 comments
I see mentions showing it reduced the size of the models but not how much memory was saved. I guess it depends on how it's used? But I would be very curious to see some benchmarking for that.
I think the headline is misleading. It's some random fork of llama.cpp, I can't find evidence that TurboQuant was actually added to llama.cpp proper.
The only legit PR I can find is this [0] and it's still open.
There's currently a lot of rejected vibe-coded PRs: [1] (violation of AI policy).
The OP's PR says it was generated with Claude Code so it has a very low chance of getting merged upstream.
[0] https://github.com/ggml-org/llama.cpp/pull/21089
[1] https://github.com/ggml-org/llama.cpp/pulls?q=Turboquant+is%...
Great news! Expecting this to get implemented in all the major inference runners pretty fast. See also: https://news.ycombinator.com/item?id=47637422
Cuda support added. Also see https://news.ycombinator.com/item?id=47562135#47635952
I see mentions showing it reduced the size of the models but not how much memory was saved. I guess it depends on how it's used? But I would be very curious to see some benchmarking for that.
I think the headline is misleading. It's some random fork of llama.cpp, I can't find evidence that TurboQuant was actually added to llama.cpp proper.
The only legit PR I can find is this [0] and it's still open.
There's currently a lot of rejected vibe-coded PRs: [1] (violation of AI policy).
The OP's PR says it was generated with Claude Code so it has a very low chance of getting merged upstream.
[0] https://github.com/ggml-org/llama.cpp/pull/21089
[1] https://github.com/ggml-org/llama.cpp/pulls?q=Turboquant+is%...
Great news! Expecting this to get implemented in all the major inference runners pretty fast. See also: https://news.ycombinator.com/item?id=47637422
Cuda support added. Also see https://news.ycombinator.com/item?id=47562135#47635952