This is huge for local-first AI, especially for privacy-sensitive workloads like memory systems.
*Why local memory inference matters:*
When building agents with long-term memory, "where does my data live?" becomes critical.
Even with E2EE (MemoryLake uses 3-party encryption so no single entity holds all keys), some users want memory extraction to happen entirely on-device.
*Architecture we're converging on:*
1. Local extraction (MLX/Ollama) – process docs/audio/video on-device
2. Encrypted sync – store structured memory in cloud
3. Centralized orchestration – multi-hop reasoning across memory graph
*Why hybrid:*
- Local: Protects raw PII during extraction
- Cloud: Enables cross-device access + powerful reasoning over full memory graph
MLX makes sub-1B domain models practical for local memory extraction. We've tested MemoryLake-D1 quantized to 4-bit on M3 – still hits 98%+ accuracy at 40 tokens/sec.
The performance gap between x86 and Apple Silicon for this workload is dramatic (3-5x faster).
LLMs on device is the future. It's more secure and solves the problem of too much demand for inference compared to data center supply, it also would use less electricity. It's just a matter of getting the performance good enough. Most users don't need frontier model performance.
I very recently installed llama.cpp on my consumer-grade M4 MBP, and I've been having loads of fun poking and prodding the local models. There's now a ChatGPT style interface baked into llama.cpp, which is very handy for quick experimentation. (I'm not entirely sure what Ollama would get me that llama.cpp doesn't, happy to hear suggestions!)
There are some surprisingly decent models that happily fit even into a mere 16 gigs of RAM. The recent Qwen 3.5 9B model is pretty good, though it did trip all over itself to avoid telling me what happened on Tiananmen Square in 1989. (But then I tried something called "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive", which veers so hard the other way that it will happily write up a detailed plan for your upcoming invasion of Belgium, so I guess it all balances out?)
It feels like you'll soon need a local llm to intermediate with the remote llm, like an ad blocker for browsers to stop them injecting ads or remind you not to send corporate IP out onto the Internet.
I have journaled digitally for the last 5 years with this expectation.
Recently I built a graphRAG app with Qwen 3.5 4b for small tasks like classifying what type of question I am asking or the entity extraction process itself, as graphRAG depends on extracted triplets (entity1, relationship_to, entity2). I used Qwen 3.5 27b for actually answering my questions.
It works pretty well. I have to be a bit patient but that’s it. So in that particular use case, I would agree.
I used MLX and my M1 64GB device. I found that MLX definitely works faster when it comes to extracting entities and triplets in batches.
Any citations? Because that was my impression, too. I want frontier model performance for my coding assistant, but "most users" could do with smaller/faster models.
ChatGPT free falls back to GPT-5.2 Mini after a few interactions.
Have you used GPT instant or mini yourself? I think it’s pretty cynical to assume that this is “good enough for most people”, even if they don’t know the difference between that and better models.
Frontier model has much better knowledge and they usually hallucinate less. It's not about the coding capabilities, it's about how much you can trust the model.
Have you tried the free version of ChatGPT? It is positively appalling. It’s like GPT 3.5 but prompted to write three times as much as necessary to seem useful. I wonder how many people have embarrassed themselves, lost their jobs, and been critically misinformed. All easy with state-of-the-art models but seemingly a guarantee with the bottom sub-slop tier.
Is the average person just talking to it about their day or something?
> It's just a matter of getting the performance good enough.
Who will pay for the ongoing development of (near-)SoTA local models? The good open-weight models are all developed by for-profit companies - you know how that story will end.
You could argue that the only reason we have good open-weight models is because companies are trying to undermine the big dogs, and they are spending millions to make sure they dont get too far ahead. If the bubble pops then there wont be incentive to keep doing it.
I agree. I can totally see in the future that open source LLMs will turn into paying a lumpsum for the model. Many will shut down. Some will turn into closed source labs.
When VCs inevitably ask their AI labs to start making money or shut down, those free open source LLMS will cease to be free.
Chinese AI labs have to release free open source models because they distill from OpenAI and Anthropic. They will always be behind. Therefore, they can't charge the same prices as OpenAI and Anthropic. Free open source is how they can get attention and how they can stay fairly close to OpenAI and Anthropic. They have to distill because they're banned from Nvidia chips and TSMC.
Before people tell me Chinese AI labs do use Nvidia chips, there is a huge difference between using older gimped Nvidia H100 (called H20) chips or sneaking around Southeast Asia for Blackwell chips and officially being allowed to buy millions of Nvidia's latest chips to build massive gigawatt data centers.
> have to release free open source models because they distill from OpenAI and Anthropic
They dont really have to though, they just need to be good enough and cheaper (even if distilled). That being said, it is true they are gaining a lot of visibility (specially Qwen) because of being open-source(weight).
Hardware-wise they seem they will catch-up in 3-5 years (Nvidia is kind of irrelevant, what matters is the node).
About 2.5 decades from the start of the JVs, but they did it. Semiconductors and jet turbines are really the last two tech trees that China has yet to master.
Right. When I said "they'll always be behind", I meant in the next 5-10 years. They're gated by EUV tech. And once they have EUV tech, they need to scale up chip manufacturing.
That also means sending every user a copy of the model that you spend billions training. The current model (running the models at the vendor side) makes it much easier to protect that investment
Man I really hope so, as, as much as I like Claude Code, I hate the company paying for it and tracking your usage, bullshit management control, etc. I feel like I'm training my replacement. Things feel like they are tightening vs more power and freedom.
On device I would gladly pay for good hardware - it's my machine and I'm using as I see fit like an IDE.
When local LLMs get good enough for you to use delightfully, cloud LLMs will have gotten so much smarter that you'll still use it for stuff that needs more intelligence.
True, but I'm already producing code/features faster than company knows what to do with, (even though every company says "omg we need this yesterday", etc). Even coding before AI was basically same.
It isn't going to replace cloud LLMs since cloud LLMs will always be faster in throughput and smarter. Cloud and local LLMs will grow together, not replace each other.
I'm not convinced that local LLMs use less electricity either. Per token at the same level of intelligence, cloud LLMs should run circles around local LLMs in efficiency. If it doesn't, what are we paying hundreds of billions of dollars for?
I think local LLMs will continue to grow and there will be an "ChatGPT" moment for it when good enough models meet good enough hardware. We're not there yet though.
Note, this is why I'm big on investing in chip manufacture companies. Not only are they completely maxed out due to cloud LLMs, but soon, they will be double maxed out having to replace local computer chips with ones that are suited for inferencing AI. This is a massive transition and will fuel another chip manufacturing boom.
It works really well for "You're helpful assistant / Hi / Hello there. how may I help you today?" Anything else (esp in non-EN language) and you will see the limitations yourself. just try it.
Yea I get that there will always be demand for local waifus. I never said local LLMs won't be a thing. I even said it will be a huge thing. Just won't replace cloud.
Looking at downvotes I feel good about SDE future in 3-5 years. We will have a swamp of "vibe-experts" who won't be able to pay 100K a month to CC. Meanwhile, people who still remember how to code in Vim will (slowly) get back to pre-COVID TC levels.
What is CC and TC? I have not heard these abbreviations (except for CC to mean credit card or carbon copy, neither of which is what I think you mean here).
Good to see Ollama is catching up with the times for inference on Mac. MLX powered inference makes a big difference, especially on M5 as their graphs point out.
What really has been a game changer for my workflow is using https://omlx.ai/ that has SSD KV cold caching. No longer have to worry about a session falling out of memory and needing to prefill again. Combine that with the M5 Max prefill speed means more time is spend on generation than waiting for 50k+ content window to process.
Already running qwen 70b 4-bit on m2 max 96gb through llama.cpp and it's pretty solid for day to day stuff. The mlx switch is interesting because ollama was basically shelling out to llama.cpp on mac before, so native mlx should mean better memory handling on apple silicon. Curious to see how it compares on the bigger models vs the gguf path
You can run Qwen3.5-35B-A3B on 32GB of RAM sure, although to get 'Claude Code' performance, which I assume he means Sonnet or Opus level models in 2026, this will likely be a few years away before its runnable locally (with reasonable hardware).
#The use of NVFP4 results in a 3.5x reduction in model memory footprint relative to FP16 and a 1.8x reduction compared to FP8, while maintaining model accuracy with less than 1% degradation on key language modeling tasks for some models.
Ollama is a user-friendly UI for LLM inference. It is powered by llama.cpp (or a fork of it) which is more power-user oriented and requires command-line wrangling. GGML is the math library behind llama.cpp and GGUF is the associated file format used for storing LLM weights.
This is huge for local-first AI, especially for privacy-sensitive workloads like memory systems.
*Why local memory inference matters:*
When building agents with long-term memory, "where does my data live?" becomes critical.
Even with E2EE (MemoryLake uses 3-party encryption so no single entity holds all keys), some users want memory extraction to happen entirely on-device.
*Architecture we're converging on:* 1. Local extraction (MLX/Ollama) – process docs/audio/video on-device 2. Encrypted sync – store structured memory in cloud 3. Centralized orchestration – multi-hop reasoning across memory graph
*Why hybrid:* - Local: Protects raw PII during extraction - Cloud: Enables cross-device access + powerful reasoning over full memory graph
MLX makes sub-1B domain models practical for local memory extraction. We've tested MemoryLake-D1 quantized to 4-bit on M3 – still hits 98%+ accuracy at 40 tokens/sec.
The performance gap between x86 and Apple Silicon for this workload is dramatic (3-5x faster).
LLMs on device is the future. It's more secure and solves the problem of too much demand for inference compared to data center supply, it also would use less electricity. It's just a matter of getting the performance good enough. Most users don't need frontier model performance.
I very recently installed llama.cpp on my consumer-grade M4 MBP, and I've been having loads of fun poking and prodding the local models. There's now a ChatGPT style interface baked into llama.cpp, which is very handy for quick experimentation. (I'm not entirely sure what Ollama would get me that llama.cpp doesn't, happy to hear suggestions!)
There are some surprisingly decent models that happily fit even into a mere 16 gigs of RAM. The recent Qwen 3.5 9B model is pretty good, though it did trip all over itself to avoid telling me what happened on Tiananmen Square in 1989. (But then I tried something called "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive", which veers so hard the other way that it will happily write up a detailed plan for your upcoming invasion of Belgium, so I guess it all balances out?)
Oh does llama.cpp use MLX or whatever? I had this question, wonder if you know? A search suggests it doesn’t but I don’t really understand.
It feels like you'll soon need a local llm to intermediate with the remote llm, like an ad blocker for browsers to stop them injecting ads or remind you not to send corporate IP out onto the Internet.
I have journaled digitally for the last 5 years with this expectation.
Recently I built a graphRAG app with Qwen 3.5 4b for small tasks like classifying what type of question I am asking or the entity extraction process itself, as graphRAG depends on extracted triplets (entity1, relationship_to, entity2). I used Qwen 3.5 27b for actually answering my questions.
It works pretty well. I have to be a bit patient but that’s it. So in that particular use case, I would agree.
I used MLX and my M1 64GB device. I found that MLX definitely works faster when it comes to extracting entities and triplets in batches.
Did you get any insights about yourself from this process? I am thinking of doing the same
Depending on the use case, the future is already here.
For example, last week I built a real-time voice AI running locally on iPhone 15.
One use case is for people learning speaking english. The STT is quite good and the small LLM is enough for basic conversation.
https://github.com/fikrikarim/volocal
Brilliant. Hope to see you in the App Store!
Oh thank you! I wasn’t sure if it was worth submitting to the app store since it was just a research preview, but I could do it if people want it.
"Most users don't need frontier model performance" unfortunately, this is not the case.
Any citations? Because that was my impression, too. I want frontier model performance for my coding assistant, but "most users" could do with smaller/faster models.
ChatGPT free falls back to GPT-5.2 Mini after a few interactions.
Have you used GPT instant or mini yourself? I think it’s pretty cynical to assume that this is “good enough for most people”, even if they don’t know the difference between that and better models.
Frontier model has much better knowledge and they usually hallucinate less. It's not about the coding capabilities, it's about how much you can trust the model.
re: trust-
Have you tried the free version of ChatGPT? It is positively appalling. It’s like GPT 3.5 but prompted to write three times as much as necessary to seem useful. I wonder how many people have embarrassed themselves, lost their jobs, and been critically misinformed. All easy with state-of-the-art models but seemingly a guarantee with the bottom sub-slop tier.
Is the average person just talking to it about their day or something?
Not sure about the using less electricity part. With batching, it’s more efficient to serve multiple users simultaneously.
Indeed. Data centers have so many ways and reasons to be much more energy-efficient than local compute it's not even funny.
> It's just a matter of getting the performance good enough.
Who will pay for the ongoing development of (near-)SoTA local models? The good open-weight models are all developed by for-profit companies - you know how that story will end.
You could argue that the only reason we have good open-weight models is because companies are trying to undermine the big dogs, and they are spending millions to make sure they dont get too far ahead. If the bubble pops then there wont be incentive to keep doing it.
I agree. I can totally see in the future that open source LLMs will turn into paying a lumpsum for the model. Many will shut down. Some will turn into closed source labs.
When VCs inevitably ask their AI labs to start making money or shut down, those free open source LLMS will cease to be free.
Chinese AI labs have to release free open source models because they distill from OpenAI and Anthropic. They will always be behind. Therefore, they can't charge the same prices as OpenAI and Anthropic. Free open source is how they can get attention and how they can stay fairly close to OpenAI and Anthropic. They have to distill because they're banned from Nvidia chips and TSMC.
Before people tell me Chinese AI labs do use Nvidia chips, there is a huge difference between using older gimped Nvidia H100 (called H20) chips or sneaking around Southeast Asia for Blackwell chips and officially being allowed to buy millions of Nvidia's latest chips to build massive gigawatt data centers.
> have to release free open source models because they distill from OpenAI and Anthropic
They dont really have to though, they just need to be good enough and cheaper (even if distilled). That being said, it is true they are gaining a lot of visibility (specially Qwen) because of being open-source(weight).
Hardware-wise they seem they will catch-up in 3-5 years (Nvidia is kind of irrelevant, what matters is the node).
I highly doubt they can catch up in 3-5 years to Nvidia.
Chips take about 3 years to design. Do you think China will have Feymann-level AI systems in 3 years?
I think in 3 years, they'll have H200-equivalent at home.
“They will always be behind”
Car manufacturers said the same.
It did take decades to catch and surpass US car makers right?
About 2.5 decades from the start of the JVs, but they did it. Semiconductors and jet turbines are really the last two tech trees that China has yet to master.
Right. When I said "they'll always be behind", I meant in the next 5-10 years. They're gated by EUV tech. And once they have EUV tech, they need to scale up chip manufacturing.
Which might they master first?
This seems to be somewhat similar to web browsers.
I could see the model becoming part of the OS.
Of course Google and Microsoft will still want you to use their models so that they can continue to spy on you.
Apple and Nvidia would sell hardware to run their own largest models.
You can have viable business model around open weight models where you offer fine tuning at a fee.
That also means sending every user a copy of the model that you spend billions training. The current model (running the models at the vendor side) makes it much easier to protect that investment
Man I really hope so, as, as much as I like Claude Code, I hate the company paying for it and tracking your usage, bullshit management control, etc. I feel like I'm training my replacement. Things feel like they are tightening vs more power and freedom.
On device I would gladly pay for good hardware - it's my machine and I'm using as I see fit like an IDE.
When local LLMs get good enough for you to use delightfully, cloud LLMs will have gotten so much smarter that you'll still use it for stuff that needs more intelligence.
True, but I'm already producing code/features faster than company knows what to do with, (even though every company says "omg we need this yesterday", etc). Even coding before AI was basically same.
Code tools that free my time up is very nice.
It isn't going to replace cloud LLMs since cloud LLMs will always be faster in throughput and smarter. Cloud and local LLMs will grow together, not replace each other.
I'm not convinced that local LLMs use less electricity either. Per token at the same level of intelligence, cloud LLMs should run circles around local LLMs in efficiency. If it doesn't, what are we paying hundreds of billions of dollars for?
I think local LLMs will continue to grow and there will be an "ChatGPT" moment for it when good enough models meet good enough hardware. We're not there yet though.
Note, this is why I'm big on investing in chip manufacture companies. Not only are they completely maxed out due to cloud LLMs, but soon, they will be double maxed out having to replace local computer chips with ones that are suited for inferencing AI. This is a massive transition and will fuel another chip manufacturing boom.
Yep. People were claiming DeepSeek was "almost as good as SOTA" when it came out. Local will always be one step away like fusion.
It's just wishful thinking (and hatred towards American megacorps). Old as the hills. Understandable, but not based on reality.
Local RTX 5090 is actually faster than A100/H100.
We are 100% there already. In browser.
the webgpu model in my browser on my m4 pro macbook was as good as chatgpt 3.5 and doing 80+ tokens/s
Local is here.
Sir, ChatGPT 3.5 is more than 3 years old, running on your bleeding edge M4 Pro hardware, and only proves the previous commenters point.
It works really well for "You're helpful assistant / Hi / Hello there. how may I help you today?" Anything else (esp in non-EN language) and you will see the limitations yourself. just try it.
You're assuming throughput sets the value, but offline use and privacy change the tradeoff fast.
Yea I get that there will always be demand for local waifus. I never said local LLMs won't be a thing. I even said it will be a huge thing. Just won't replace cloud.
Looking at downvotes I feel good about SDE future in 3-5 years. We will have a swamp of "vibe-experts" who won't be able to pay 100K a month to CC. Meanwhile, people who still remember how to code in Vim will (slowly) get back to pre-COVID TC levels.
What is CC and TC? I have not heard these abbreviations (except for CC to mean credit card or carbon copy, neither of which is what I think you mean here).
I figured it out from context clues
CC: Claude Code
TC: total comp(ensation)
Thank you for clarifying! (I had no idea it needs to be explained, sorry.)
Good to see Ollama is catching up with the times for inference on Mac. MLX powered inference makes a big difference, especially on M5 as their graphs point out. What really has been a game changer for my workflow is using https://omlx.ai/ that has SSD KV cold caching. No longer have to worry about a session falling out of memory and needing to prefill again. Combine that with the M5 Max prefill speed means more time is spend on generation than waiting for 50k+ content window to process.
Already running qwen 70b 4-bit on m2 max 96gb through llama.cpp and it's pretty solid for day to day stuff. The mlx switch is interesting because ollama was basically shelling out to llama.cpp on mac before, so native mlx should mean better memory handling on apple silicon. Curious to see how it compares on the bigger models vs the gguf path
How does it compare to some of the newer mlx inference engines like optiq that support turboquantization - https://mlx-optiq.pages.dev/
still waiting for the day I can comfortably run Claude Code with local llm's on MacOS with only 16gb of ram
How close is this? It says it needs 32GB min?
You can run Qwen3.5-35B-A3B on 32GB of RAM sure, although to get 'Claude Code' performance, which I assume he means Sonnet or Opus level models in 2026, this will likely be a few years away before its runnable locally (with reasonable hardware).
I fully agree, I run that one with Q4 on my MBP, and the performance (including quality of response) is a let down.
I am wondering how people rave so much about local "small devices" LLM vs what codex or Claude code are capable of.
Sadly there are too much hype on local LLM, they look great for 5min tests and that's it.
Just train it better with AGENTS.md
Finally! My local infra is waiting for it for months!
How does this compare to llama.cpp in terms of performance?
MLX is a bit faster (low double digit percentage), but uses a bit more RAM. Worthwhile tradeoff for many.
"We can run your dumbed down models faster":
#The use of NVFP4 results in a 3.5x reduction in model memory footprint relative to FP16 and a 1.8x reduction compared to FP8, while maintaining model accuracy with less than 1% degradation on key language modeling tasks for some models.
What is the difference between Ollama, llama.cpp, ggml and gguf?
Ollama is a user-friendly UI for LLM inference. It is powered by llama.cpp (or a fork of it) which is more power-user oriented and requires command-line wrangling. GGML is the math library behind llama.cpp and GGUF is the associated file format used for storing LLM weights.
i've found llama.cpp (as i understand it, ollama now uses their own version of this) to work much better in practice, faster and much more flexible.
Ollama on MacOS is a one-click solution with stable obe-click updates. Happy so far. But the mlx support was the only missing piece for me.