Astro - Hacker News

4 comments

pogue 2 minutes ago

I see mentions showing it reduced the size of the models but not how much memory was saved. I guess it depends on how it's used? But I would be very curious to see some benchmarking for that.
kgeist 2 hours ago

I think the headline is misleading. It's some random fork of llama.cpp, I can't find evidence that TurboQuant was actually added to llama.cpp proper.
The only legit PR I can find is this [0] and it's still open.
There's currently a lot of rejected vibe-coded PRs: [1] (violation of AI policy).
The OP's PR says it was generated with Claude Code so it has a very low chance of getting merged upstream.
[0] https://github.com/ggml-org/llama.cpp/pull/21089
[1] https://github.com/ggml-org/llama.cpp/pulls?q=Turboquant+is%...
jsilence 2 hours ago

Great news! Expecting this to get implemented in all the major inference runners pretty fast. See also: https://news.ycombinator.com/item?id=47637422
lastdong 3 hours ago

Cuda support added. Also see https://news.ycombinator.com/item?id=47562135#47635952