> not to be confused with the somewhat baffling llama_chat_apply_template exposed in the libllama API, which hardcodes a handful of chat formats directly in C++
As someone who is tinkering with a desktop-based inference app in FLTK[0], i wish this used the actual Jinja2 template parser llama.cpp uses (or there was another C function that did that since AFAICT for "proper" parsing you need to be able to pass a bunch of data to the template so it knows if you, e.g., do tool calling). Currently i'm using this adhocky function, but i guess i'll either write a Jinja2 interpreter or copy/paste the one from llama.cpp's code (depending on how i feel at the time :-P).
But yeah, GGUF's "all-in-one" approach is very convenient. And i agree that it feels odd to have the projection models as separate files - i remember when i first download a vision-capable model, i just grabbed whatever GGUF looked appropriate, then llama.cpp told me it couldn't do model and took me a bit to realize that i had to download an extra file. Literally my thought once i did was "wasn't GGUF supposed to contain everything?" :-P
7b mistral is quite outdated. On a 12gb 4070 you can run qwen 3.5 9b q4km or qwen 3.6 35b, the latter will be a lot smarter but also a lot slower due to ram offload.
Try both in lm studio, they really are surprisingly capable
I love mistral, but that model is... not the best. Maybe try out Gemma 4 e4b, it's a similar size to Mistral 7B, and should run great on your 4070 ("E4B" is slightly misleading naming).
Yeah, TheBloke era of local LLMs were good times. TBF Unsloth are doing a fantastic job of publishing quants of the major models quickly - they just don't have nearly the volume of "weird" models as TheBloke did.
What do you use it for? I'm still trying to use agents, I barely use copilot, only at work when I have to.
I didn't want to get personal with an LLM unless it was local so that's why I was setting this up but yeah. So far just research is what I was looking at.
I don't think any of these model runtimes have really caught up with the architectures that are going to make the most sense for the capabilities of the coding models we'll have "6-12 months from now"
Imo -- we'll need the weights and abstract description layer for the model computation graphs -- but these little jinja2 programs and the various different structured grammar mechanisms (llguidance type complexity) and possibly the entire library of "model graph execution runtime" should all just go away in favor of ahead of time model -> hardware specific purely inline generated e2e implementation ... The documentation for a new model will just be a few example input / output interactions that _demonstrate_ the model io encoding protocol -- to the extent the jinja2 type setups actually work and generalize its all convention based anyway and a set of appropriate examples + "logic based" convention around the set of ways that make sense to interpret those examples should be able map to describe a deterministic mapping of that protocol in the only ways that make sense sufficient to cover all the variations in the protocols we've seen to date anyway ... a little jinja2 here or there doesn't actually clarify things more than that would really as there's still a convention associated with interpreting the jinja that's needlessly limiting
"hey gpt 5.6/mythos2 -- I want to run this model on iphone as fast as humanly possible" -> "no problem I see that model contains 6 signatures, there's an embedder, a per layer embedder, a decoder, and a few variants of prefill for fast prompt processing -- do you know the size distribution of the prompts? i can include support for all the prefill size variants or if you want to optimize for smallest possible distribution -- i'll assume a max prompt size and exclude the code needed for the larger prefill sizes? Connect a few devices if you don't mind or i'll attach virtual hardware when it's available and optimize the solution e2e -- i'll synthesize a performance and accuracy benchmark and then perform a search over that benchmark set for the fastest model loading and inference implementations for each hardware included in the evaluation matrix. The longer it runs the more confidence we should have in speed and robustness of the setup. I can stop automatically when it looks like things are good or continue the search for a fixed amount of time if you like..."
I've been dealing with various edge device runtimes lately and already running a sort of "worse" version of the above process which is imo -- mainly mostly limited by the need to "take advantage" of the "annoyingly shaped model runtime abstraction layers of old". I'm almost positive the whole setup would already be workable with when driven by existing frontier models if using model runtimes that were just "big piles of example code" which the model and i could draw from and modify l/inline into place as needed aa vs the current shape of "sort of semi general purpose computation graphs but with deeply non-observable error mechanisms and sources of ambiguity" ...
> not to be confused with the somewhat baffling llama_chat_apply_template exposed in the libllama API, which hardcodes a handful of chat formats directly in C++
As someone who is tinkering with a desktop-based inference app in FLTK[0], i wish this used the actual Jinja2 template parser llama.cpp uses (or there was another C function that did that since AFAICT for "proper" parsing you need to be able to pass a bunch of data to the template so it knows if you, e.g., do tool calling). Currently i'm using this adhocky function, but i guess i'll either write a Jinja2 interpreter or copy/paste the one from llama.cpp's code (depending on how i feel at the time :-P).
But yeah, GGUF's "all-in-one" approach is very convenient. And i agree that it feels odd to have the projection models as separate files - i remember when i first download a vision-capable model, i just grabbed whatever GGUF looked appropriate, then llama.cpp told me it couldn't do model and took me a bit to realize that i had to download an extra file. Literally my thought once i did was "wasn't GGUF supposed to contain everything?" :-P
[0] https://i.imgur.com/GiTBE1j.png
Oh my God I freaking love your app. The 90s Linux desktop vibes hit like a hammer. FLTK FTW!
Nice, I recently pulled down TheBloke 7B mistral to try out I have a 4070.
7b mistral is quite outdated. On a 12gb 4070 you can run qwen 3.5 9b q4km or qwen 3.6 35b, the latter will be a lot smarter but also a lot slower due to ram offload.
Try both in lm studio, they really are surprisingly capable
I have 80gb of ram but it's slow capped by i9 CPU or specific asus mobo sucks I think only 2400mhz despite being ddr4
Tried all the stuff bios, volting
I love mistral, but that model is... not the best. Maybe try out Gemma 4 e4b, it's a similar size to Mistral 7B, and should run great on your 4070 ("E4B" is slightly misleading naming).
Thanks for the tip, what do you use Gemma 4 e4b for?
some say it’s a miniaturized gemini model
it’s good at writing, coding, decently intelligent
you can try it on nvidia nim
I have a 2070 and can confirm it works amazingly fast.
I love TheBloke I wish he still made stuff
Yeah, TheBloke era of local LLMs were good times. TBF Unsloth are doing a fantastic job of publishing quants of the major models quickly - they just don't have nearly the volume of "weird" models as TheBloke did.
What do you use it for? I'm still trying to use agents, I barely use copilot, only at work when I have to.
I didn't want to get personal with an LLM unless it was local so that's why I was setting this up but yeah. So far just research is what I was looking at.
I don't think any of these model runtimes have really caught up with the architectures that are going to make the most sense for the capabilities of the coding models we'll have "6-12 months from now"
Imo -- we'll need the weights and abstract description layer for the model computation graphs -- but these little jinja2 programs and the various different structured grammar mechanisms (llguidance type complexity) and possibly the entire library of "model graph execution runtime" should all just go away in favor of ahead of time model -> hardware specific purely inline generated e2e implementation ... The documentation for a new model will just be a few example input / output interactions that _demonstrate_ the model io encoding protocol -- to the extent the jinja2 type setups actually work and generalize its all convention based anyway and a set of appropriate examples + "logic based" convention around the set of ways that make sense to interpret those examples should be able map to describe a deterministic mapping of that protocol in the only ways that make sense sufficient to cover all the variations in the protocols we've seen to date anyway ... a little jinja2 here or there doesn't actually clarify things more than that would really as there's still a convention associated with interpreting the jinja that's needlessly limiting
"hey gpt 5.6/mythos2 -- I want to run this model on iphone as fast as humanly possible" -> "no problem I see that model contains 6 signatures, there's an embedder, a per layer embedder, a decoder, and a few variants of prefill for fast prompt processing -- do you know the size distribution of the prompts? i can include support for all the prefill size variants or if you want to optimize for smallest possible distribution -- i'll assume a max prompt size and exclude the code needed for the larger prefill sizes? Connect a few devices if you don't mind or i'll attach virtual hardware when it's available and optimize the solution e2e -- i'll synthesize a performance and accuracy benchmark and then perform a search over that benchmark set for the fastest model loading and inference implementations for each hardware included in the evaluation matrix. The longer it runs the more confidence we should have in speed and robustness of the setup. I can stop automatically when it looks like things are good or continue the search for a fixed amount of time if you like..."
I've been dealing with various edge device runtimes lately and already running a sort of "worse" version of the above process which is imo -- mainly mostly limited by the need to "take advantage" of the "annoyingly shaped model runtime abstraction layers of old". I'm almost positive the whole setup would already be workable with when driven by existing frontier models if using model runtimes that were just "big piles of example code" which the model and i could draw from and modify l/inline into place as needed aa vs the current shape of "sort of semi general purpose computation graphs but with deeply non-observable error mechanisms and sources of ambiguity" ...
>Published May 18, 2026
hmmm...
whoops, my bad. Just a typo in the markdown. Fixed :)