Well, there are only so many nouns, and even fewer "cool-sounding" ones. For better project differentiation, do you think we should instead be naming things "ZurgGlurg327"? I'm sure you can find a completely-unique combo for each thing, but good luck remembering the name!
docker images and ubuntu releases use an adjective, this could at least allow some alternatives like bold/supreme/decisive/depressed eagle (or just use battery staple)
Speculative decoding shouldn't actually change the accuracy of the response. The draft model drafts a couple tokens, and the inference framework verifies that the larger model would have picked them.
However, I've found that speculative decoders don't help much if you're running a model locally on limited hardware (for instance, my 32GB VRAM M1 Max from 2021). For one, you have to fit both the large and the small drafter model in memory. For another, if you're running a quantized model, the activation distribution is different enough that the draft model has a hard time guessing what's coming next.
My take is that speculative decoding is most useful on _very expensive_ prosumer/hobbyist setups where you have 128GB of VRAM and are running your local models with full fidelity. It's also helpful for inference providers where they can send output tokens at a computational cost slightly higher than their input token cost.
They work better for coding workloads. Essentially, the more regular the output, the more the faster model gets right, the less the big model has to do.
Writing tends to have more false positives. I haven't tried this particular one, however, but that is the general trend.
I saw EAGLE and thought it's going to be about PCB design. Was left disappointed.
Well, there are only so many nouns, and even fewer "cool-sounding" ones. For better project differentiation, do you think we should instead be naming things "ZurgGlurg327"? I'm sure you can find a completely-unique combo for each thing, but good luck remembering the name!
docker images and ubuntu releases use an adjective, this could at least allow some alternatives like bold/supreme/decisive/depressed eagle (or just use battery staple)
Are these speculative decoders ok to use for AI coding agents or do they only fit certain workloads?
Speculative decoding shouldn't actually change the accuracy of the response. The draft model drafts a couple tokens, and the inference framework verifies that the larger model would have picked them.
However, I've found that speculative decoders don't help much if you're running a model locally on limited hardware (for instance, my 32GB VRAM M1 Max from 2021). For one, you have to fit both the large and the small drafter model in memory. For another, if you're running a quantized model, the activation distribution is different enough that the draft model has a hard time guessing what's coming next.
My take is that speculative decoding is most useful on _very expensive_ prosumer/hobbyist setups where you have 128GB of VRAM and are running your local models with full fidelity. It's also helpful for inference providers where they can send output tokens at a computational cost slightly higher than their input token cost.
They work better for coding workloads. Essentially, the more regular the output, the more the faster model gets right, the less the big model has to do.
Writing tends to have more false positives. I haven't tried this particular one, however, but that is the general trend.
I think so, the benchmark is on a coding dataset (SPEED-Bench).