That's almost exactly my setup and I'm very happy with its performance.
I noticed recently that I started to prefer my local Qwen3.6 35B A3B and pi agent over Claude Code.
Both fail at different tasks, and Qwen more so than Claude.
But the way Qwen fails is much more straightforward. In writing tasks Qwens hallucinations and bullshitting are much easier to spot because it doesn't have the sleek vocabulary and wordsmithing skills to disguise its ignorance.
In coding tasks that Qwen can't solve it often just goes into a tool calling doom loop that the pi harness can catch, whereas Claude attempts ever more convoluted and creative things just making more and more mess that takes forever to clean up.
I think part of the story is that the tasks for which I use AI are fairly simple and maybe don't need a frontier model. But I wonder if "proper" developers had similar experience?
I keep finding more and more usecases for Q3.6 27b (same league) and the best performance is, when answers to my question is already in the context.
The moment I'm trying something open-ended or ambitious, Claude/ChatGPT clearly take you to the goal quicker.
For things, where there's a way to build a knowledgebase though, the local llm definitely can be a true contender. Plus, having a big context and no worries about filling it over and over - you can get quite far.
I'm writing this, literally in between cooking a pasta, that the local llm ordered products for me online. I've built a grocery shopping skill, so that it roughly knows what I have in fridge (losely), my last 10 representative orders (general preferences plus rich info about shops and skus around me) and actual real-time in stock info. The last part has been my personal pet peeve for every product that promised cooking ingredient delivery (that is not packaged specifically for that).
This is what has been promised to us by every big tech company with an agent, and now a local llms actually solved that for me fully.
It's also going to fail consistently. When calling Claude you don't know what version of the model you are talking to, it might be quantified sure to load or have been patched.
This is true. The failure modes are simpler. And yes the ceiling is lower as well. Smaller models stability is lower over long sequences. And thus anything that needs a lot of CoT will be weaker. For example, I had a dumb lock + condvar with multiple defenses against lost wakeups in a N producer 1 consumer queue thing. Models generally need a lot of CoT before they realise they can switch it to a semaphore instead. Qwen typically isn't stable over such long CoTs and ends up adding more and more slop and band aids versus a larger model that outputs a large CoT and then realises it can swap 3 functions out with 2 lines if we use a semaphore.
80tp/s with 5080 3090 combo is wild. I’ve been working with a 4090 and two Tenstorrent p150 cards, and manage only about 30 tps utilizing all three for qwen3.6 27b q8. Guess I got more optimization to do.
Would like to see the perf of their setup with and without mtp and ngram speculative decoding though, as well as parallel decode performance (once llamacpp mtp plays well with multiple slots).
Being in California electricity alone puts this non-competitive with just paying a cloud though.
That’s the cost of using a new hardware provider. A single RTX Pro 6000 Blackwell Max-Q will do better than that and be much more usable. I have 2 running DS4 Flash at 160 tok/s with max num seqs 4.
Very interesting though, these Tenstorrent chips. Might get one to experiment with.
I just bought a $25 chinese 2x Oculink card and two Minis Forum DEG1, had some spare PSUs lying around, and just installed two cards on each.
It works.
I saw that there is also a 4x Oculink card, but i don't know it that will work, too.
Would you mind giving these a try and let me know how they work for you? I’d imagine you would get better results and the latter will fit on a single GPU.
I bought two 3080/20gb and one of those MACHINIST X99 mainboards as well (one with two full x16 pcie slots) those boards come with a xeon cpu included (for the pcie lane support) it set me back 800 euros total (had a spare psu, ssd and mem in a drawer) and now im also happily running 80tk/s Qwen 3.6 Q8 (MTP).
CNS means Chipset not supported and I doubt it is the case, are you sure you are using the patched nvidia module? modinfo nvidia to check which one is loaded
I'm using bazzite on my ai-rig just because it has the gpu-optimized things setup (also nvidia-open).
Looking at P2P seems to be available only for 90-versions of the nvidia rtx gpu line, not 80, and some versions of 50xx? (apparently the 5080?).
Anyways, i downloaded that uncensored model and tweaked those kv settings etc. still getting 60-80tk/s but im able to get my context on 180224 now, used to be 131072 which gave me some trouble, this is already a win :)
On Apple Silicon, with MLX-LM, I am getting 20 tok/s with Macbook Max M5.
Not sure how it compares to llama.cpp performance.
In any case, while it is noticeably slower than this Nvidia RTX setup, being able to run such models on laptop is wild. Though, it heats my laptop rapidly.
I can understand the joy of running things yourself, and can also see the privacy aspect. However, I pay ~3$ per 1M/tokens for that model on Openrouter, and it's not even quantized. A refurbished 3090 and a 5080 will set you back well over 2k, not to mention the electricity to run them...
> I pay ~3$ per 1M/tokens for that model on Openrouter
I think the thing is, there's an unspoken "for now" at the end of that sentence and people running this locally are hedging against that "for now". Some people prefer to feel that they own the means rather than rent the means, even if the one they own is worse than the one they can rent. Especially with today's Fable news and the harsh realisation that the "for now" is dependent on very many unpredictable factors, where the one you have locally costs you capital today and a relatively predictable run-rate (made more predictable with on-prem solar for example), but should otherwise work predictably forever.
I'm not saying that you're wrong to do what you're doing, just that many people have their own lines in the sand where renting vs buying makes sense, and it doesn't only boil down to a rational (or irrational) financial decision.
You're treating open weight inference providers the same as proprietary ones. They're fundamentally different business models. Proprietary companies have an incentive to subsidize actual inference and training costs in order to gain market share. The few dozen or so companies selling Qwen models by the token on openrouter are in a commodities market.
If suddenly the CCP declared a total digital embargo on Alibaba's Qwen models or even if for some reason all of mainland China (and Singapore) was completely unreachable from the rest of the world, the dozen or so companies selling Qwen by the token elsewhere in the world could continue business as usual.
I don’t know anything about the open weight host business model. Do we know for certain that the folks selling inference by the token are really selling them in an upfront and profitable way? No subsidies from harvesting the info, to sell to the model trainers or anything like that?
I was thinking of user-side regulations as well, not only provider-side ones. I could imagine a world where a government rules that you may not use LLMs for anything, which would be much easier to get around if you have local means.
An R9700 is $1350 and can get 100 TPS running Qwen3.6-35B-A3B Q5 with 130k context window (with room to spare) with a bit of fine tuning llamacpp-vulkan, but llamacpp's repository instability and lack of real versioning frustrates me.
In terms of electricity, if you aren't using it, even with all the vram loaded, at most your wasting about 30 watts or so.
Prompt processing a large uncached context is annoying, which is why I forced a lower context window, but I don't know if it's any worse in performance than the cloud models I've used.
There's a niceness, to me, knowing I don't have to rent it anymore. If you rent it, the terms can change regularly.
I've spent the past week trying to scheme a way to get affordable local inference of something useful (Qwen3.6-36B-A3B) for ~$500 and have come to the conclusion that it simply isn't viable. A pair of power-restricted P100s in a workstation gets close but the workstations themselves are expensive and rare as hen's teeth (not to mention loud and large). I think early '27 will be when things open up as the hardware market unclenches and further strides are made in small capable models.
I think it's important to be able to do both so you can stay in control of the price to value created relationship.
In last year, some people were publishing aider /ollama/open router [1] and now thankfully people are publishing all around about pi/qwen/llama.cpp/openrouter. It's widespread.
I use local models to explore, hosted models to refine. I somewhat envy those who can sustain local models (q8 120b+) running as a hobby.... for me, the practical path is a better SearXNG setup and knowing my routes forward.
It’s a personal hobby project why should we care this is how someone chooses to spend their free time and money? Lots of hobbies are expensive and pointless if you think of commercially available offerings. That’s why it’s a hobby and not a small business
Any sane crypto miner undervolted and underclocked their GPUs for efficiency's sake; if anything, they went through less wear than, say, regular gaming.
Openrouter doesn't give you access to the models internals, i.e. complete control of logprobs, sampler stack, any PeFTs.
Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI and accept that the cost you'll pay for tokens is higher than you will when consumed via any cloud. That's the price for privacy, control, and better quality via inference time optimizations that otherwise aren't available.
> Openrouter doesn't give you access to the models internals, i.e. complete control of logprobs, sampler stack, any PeFTs.
Openrouter gives you access to whatever the inference provider gives. They're just the middleman. Many providers give logprobs if you ask, it's in their API. And yeah, no Peft or Lora, but that's an entirely different product. And some of the inference providers do that directly.
> Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI
But the whole point of openrouter is that you can run models by the token and you don't have to care about local AI? Sounds like you're more upset that people aren't making the same calculation on privacy and local control vs cost and ease of use.
I would have liked to see a bit more on the theory side of things, explaining optimal weight and inference splits, actual issues with existing drivers, etc instead of what’s essentially just a recipe.
Agreed. To put this in perspective, batch 1 token decode is bandwidth limited in theory.
Memory bandwidth of RTX 3090 is listed as 936GB/s. The post isn't fully clear on which model they used and how big it is, but even assuming it perfectly filled the 24GB of that GPU, 30tok/s means the achieved bandwidth is only 720GB/s. There's a bunch of room for improvement here even without MTP, and those improvements should largely stack with MTP.
I've been using https://spark-arena.com/leaderboard to glean this kind of information for DGX Spark, a sort of recipe book. The Nvidia forum has people talking about the things you wish to know. I see some on Discord/Reddit/et al, but less cohesive
I've switched from using the spark as a way to run one model as best it can to running several support models for the md kb I'm working on
2xRTX5080 would be awesome. You'd only be able to run a q6, which it's already pretty good, but moreover you'd be able to use P2P and use Blackwell full speed, which I can't.
It does come with one tiny little issue: it now draws 700W on full load. Just a single 5080 is enough to measurably heat up a room when loaded (320W draw at the wall on mine), and with that amount of power flowing through, you better have a good PSU as well as checking your power plugs themselves, these are going to get HOT when your entire setup is basically drawing 1kW.
I am actually surprised with the power draw, the box itself idles at 20W, which already amazes me for a Ryzen; when computing, I barely pass the 600W bar, and as I am not really using it to vibecode an entire system, I don't even notice the spikes on the power monitor (Shelly + homeassistant).
That's almost exactly my setup and I'm very happy with its performance.
I noticed recently that I started to prefer my local Qwen3.6 35B A3B and pi agent over Claude Code.
Both fail at different tasks, and Qwen more so than Claude.
But the way Qwen fails is much more straightforward. In writing tasks Qwens hallucinations and bullshitting are much easier to spot because it doesn't have the sleek vocabulary and wordsmithing skills to disguise its ignorance.
In coding tasks that Qwen can't solve it often just goes into a tool calling doom loop that the pi harness can catch, whereas Claude attempts ever more convoluted and creative things just making more and more mess that takes forever to clean up.
I think part of the story is that the tasks for which I use AI are fairly simple and maybe don't need a frontier model. But I wonder if "proper" developers had similar experience?
I keep finding more and more usecases for Q3.6 27b (same league) and the best performance is, when answers to my question is already in the context.
The moment I'm trying something open-ended or ambitious, Claude/ChatGPT clearly take you to the goal quicker.
For things, where there's a way to build a knowledgebase though, the local llm definitely can be a true contender. Plus, having a big context and no worries about filling it over and over - you can get quite far.
I'm writing this, literally in between cooking a pasta, that the local llm ordered products for me online. I've built a grocery shopping skill, so that it roughly knows what I have in fridge (losely), my last 10 representative orders (general preferences plus rich info about shops and skus around me) and actual real-time in stock info. The last part has been my personal pet peeve for every product that promised cooking ingredient delivery (that is not packaged specifically for that).
This is what has been promised to us by every big tech company with an agent, and now a local llms actually solved that for me fully.
It's also going to fail consistently. When calling Claude you don't know what version of the model you are talking to, it might be quantified sure to load or have been patched.
This is true. The failure modes are simpler. And yes the ceiling is lower as well. Smaller models stability is lower over long sequences. And thus anything that needs a lot of CoT will be weaker. For example, I had a dumb lock + condvar with multiple defenses against lost wakeups in a N producer 1 consumer queue thing. Models generally need a lot of CoT before they realise they can switch it to a semaphore instead. Qwen typically isn't stable over such long CoTs and ends up adding more and more slop and band aids versus a larger model that outputs a large CoT and then realises it can swap 3 functions out with 2 lines if we use a semaphore.
80tp/s with 5080 3090 combo is wild. I’ve been working with a 4090 and two Tenstorrent p150 cards, and manage only about 30 tps utilizing all three for qwen3.6 27b q8. Guess I got more optimization to do.
Would like to see the perf of their setup with and without mtp and ngram speculative decoding though, as well as parallel decode performance (once llamacpp mtp plays well with multiple slots).
Being in California electricity alone puts this non-competitive with just paying a cloud though.
That’s the cost of using a new hardware provider. A single RTX Pro 6000 Blackwell Max-Q will do better than that and be much more usable. I have 2 running DS4 Flash at 160 tok/s with max num seqs 4.
Very interesting though, these Tenstorrent chips. Might get one to experiment with.
How is the software compatibilty with the Tenstorrent cards? Are you stuck using vendor supplied runtimes/models?
It's surprising how little these things come up given the price they go for
Potential specs:
NVIDIA GeForce RTX 5080: https://flopper.io/gpu/nvidia-geforce-rtx-5080-16gb
NVIDIA GeForce RTX 3090: https://flopper.io/gpu/nvidia-geforce-rtx-3090-24gb
I just bought a $25 chinese 2x Oculink card and two Minis Forum DEG1, had some spare PSUs lying around, and just installed two cards on each. It works. I saw that there is also a 4x Oculink card, but i don't know it that will work, too.
Would you mind giving these a try and let me know how they work for you? I’d imagine you would get better results and the latter will fit on a single GPU.
https://huggingface.co/easiest-ai-shawn/Qwen3.6-27B-ExCal-EX...
https://huggingface.co/easiest-ai-shawn/Qwen3.6-27B-ExCal-Mi...
Do be sure to use dflash and/or mtp for the draft:
https://huggingface.co/turboderp/Qwen3.6-27B-MTP-exl3
https://huggingface.co/turboderp/Qwen3.6-27B-DFlash-exl3
I bought two 3080/20gb and one of those MACHINIST X99 mainboards as well (one with two full x16 pcie slots) those boards come with a xeon cpu included (for the pcie lane support) it set me back 800 euros total (had a spare psu, ssd and mem in a drawer) and now im also happily running 80tk/s Qwen 3.6 Q8 (MTP).
Good call, I really hesitated between the X570 and the X99, are you using P2P?
$ nvidia-smi topo -p2p r
GPU0 GPU1
GPU0 X CNS
GPU1 CNS X
i guess not, i use llama.cpp with:
--spec-draft-n-max 3 --spec-type draft-mtp --split-mode tensor --tensor-split 1,1
and my (gen) tk/s are between 60-80 tk/s
will test this uncensored model and ngram added as well this weekend
btw, i also set my powerlimit to 220watt per card (with nvidia-smi) that will cost you around 1 tk/s but safe you a LOT of power and heat :)
CNS means Chipset not supported and I doubt it is the case, are you sure you are using the patched nvidia module? modinfo nvidia to check which one is loaded
I'm using bazzite on my ai-rig just because it has the gpu-optimized things setup (also nvidia-open). Looking at P2P seems to be available only for 90-versions of the nvidia rtx gpu line, not 80, and some versions of 50xx? (apparently the 5080?). Anyways, i downloaded that uncensored model and tweaked those kv settings etc. still getting 60-80tk/s but im able to get my context on 180224 now, used to be 131072 which gave me some trouble, this is already a win :)
I really like Qwen 3.6 27B Q8.
On Apple Silicon, with MLX-LM, I am getting 20 tok/s with Macbook Max M5. Not sure how it compares to llama.cpp performance.
In any case, while it is noticeably slower than this Nvidia RTX setup, being able to run such models on laptop is wild. Though, it heats my laptop rapidly.
I can understand the joy of running things yourself, and can also see the privacy aspect. However, I pay ~3$ per 1M/tokens for that model on Openrouter, and it's not even quantized. A refurbished 3090 and a 5080 will set you back well over 2k, not to mention the electricity to run them...
> I pay ~3$ per 1M/tokens for that model on Openrouter
I think the thing is, there's an unspoken "for now" at the end of that sentence and people running this locally are hedging against that "for now". Some people prefer to feel that they own the means rather than rent the means, even if the one they own is worse than the one they can rent. Especially with today's Fable news and the harsh realisation that the "for now" is dependent on very many unpredictable factors, where the one you have locally costs you capital today and a relatively predictable run-rate (made more predictable with on-prem solar for example), but should otherwise work predictably forever.
I'm not saying that you're wrong to do what you're doing, just that many people have their own lines in the sand where renting vs buying makes sense, and it doesn't only boil down to a rational (or irrational) financial decision.
You're treating open weight inference providers the same as proprietary ones. They're fundamentally different business models. Proprietary companies have an incentive to subsidize actual inference and training costs in order to gain market share. The few dozen or so companies selling Qwen models by the token on openrouter are in a commodities market.
If suddenly the CCP declared a total digital embargo on Alibaba's Qwen models or even if for some reason all of mainland China (and Singapore) was completely unreachable from the rest of the world, the dozen or so companies selling Qwen by the token elsewhere in the world could continue business as usual.
I don’t know anything about the open weight host business model. Do we know for certain that the folks selling inference by the token are really selling them in an upfront and profitable way? No subsidies from harvesting the info, to sell to the model trainers or anything like that?
I was thinking of user-side regulations as well, not only provider-side ones. I could imagine a world where a government rules that you may not use LLMs for anything, which would be much easier to get around if you have local means.
An R9700 is $1350 and can get 100 TPS running Qwen3.6-35B-A3B Q5 with 130k context window (with room to spare) with a bit of fine tuning llamacpp-vulkan, but llamacpp's repository instability and lack of real versioning frustrates me.
In terms of electricity, if you aren't using it, even with all the vram loaded, at most your wasting about 30 watts or so.
Prompt processing a large uncached context is annoying, which is why I forced a lower context window, but I don't know if it's any worse in performance than the cloud models I've used.
There's a niceness, to me, knowing I don't have to rent it anymore. If you rent it, the terms can change regularly.
"An R9700 is $1350 and can get 100 TPS running Qwen3.6-35B-A3B Q5 with 130k context window ..."
How would that change (improve) if you had two R9700 in a similar configuration ?
Qwen 27b is a compute heavy dense model.
I've spent the past week trying to scheme a way to get affordable local inference of something useful (Qwen3.6-36B-A3B) for ~$500 and have come to the conclusion that it simply isn't viable. A pair of power-restricted P100s in a workstation gets close but the workstations themselves are expensive and rare as hen's teeth (not to mention loud and large). I think early '27 will be when things open up as the hardware market unclenches and further strides are made in small capable models.
I think it's important to be able to do both so you can stay in control of the price to value created relationship.
In last year, some people were publishing aider /ollama/open router [1] and now thankfully people are publishing all around about pi/qwen/llama.cpp/openrouter. It's widespread.
[1] https://alexhans.github.io/posts/aider-with-open-router.html
I use local models to explore, hosted models to refine. I somewhat envy those who can sustain local models (q8 120b+) running as a hobby.... for me, the practical path is a better SearXNG setup and knowing my routes forward.
> not to mention the electricity to run them...
And noise.
When they declare open models a 'security risk', his setup will be running, yours will not and even that 3090 will be way outside of your reach.
It’s a personal hobby project why should we care this is how someone chooses to spend their free time and money? Lots of hobbies are expensive and pointless if you think of commercially available offerings. That’s why it’s a hobby and not a small business
You are paying with your privacy ...
Yeah but they can also be used to play games and do other stuff.
Rtx 3090 24 gb set me back 390€ a year ago ( 2nd hand)
Was it still in good condition? That price makes me wonder if it was used for crypto mining, which can wear down the hardware.
Any sane crypto miner undervolted and underclocked their GPUs for efficiency's sake; if anything, they went through less wear than, say, regular gaming.
Openrouter doesn't give you access to the models internals, i.e. complete control of logprobs, sampler stack, any PeFTs.
Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI and accept that the cost you'll pay for tokens is higher than you will when consumed via any cloud. That's the price for privacy, control, and better quality via inference time optimizations that otherwise aren't available.
> Openrouter doesn't give you access to the models internals, i.e. complete control of logprobs, sampler stack, any PeFTs.
Openrouter gives you access to whatever the inference provider gives. They're just the middleman. Many providers give logprobs if you ask, it's in their API. And yeah, no Peft or Lora, but that's an entirely different product. And some of the inference providers do that directly.
> Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI
But the whole point of openrouter is that you can run models by the token and you don't have to care about local AI? Sounds like you're more upset that people aren't making the same calculation on privacy and local control vs cost and ease of use.
I would have liked to see a bit more on the theory side of things, explaining optimal weight and inference splits, actual issues with existing drivers, etc instead of what’s essentially just a recipe.
Agreed. To put this in perspective, batch 1 token decode is bandwidth limited in theory.
Memory bandwidth of RTX 3090 is listed as 936GB/s. The post isn't fully clear on which model they used and how big it is, but even assuming it perfectly filled the 24GB of that GPU, 30tok/s means the achieved bandwidth is only 720GB/s. There's a bunch of room for improvement here even without MTP, and those improvements should largely stack with MTP.
I've been using https://spark-arena.com/leaderboard to glean this kind of information for DGX Spark, a sort of recipe book. The Nvidia forum has people talking about the things you wish to know. I see some on Discord/Reddit/et al, but less cohesive
I've switched from using the spark as a way to run one model as best it can to running several support models for the md kb I'm working on
If I had an eGPU right now, I'd 100% be using Qwen
Could 2x RTX5080 work just as well?
2xRTX5080 would be awesome. You'd only be able to run a q6, which it's already pretty good, but moreover you'd be able to use P2P and use Blackwell full speed, which I can't.
It does come with one tiny little issue: it now draws 700W on full load. Just a single 5080 is enough to measurably heat up a room when loaded (320W draw at the wall on mine), and with that amount of power flowing through, you better have a good PSU as well as checking your power plugs themselves, these are going to get HOT when your entire setup is basically drawing 1kW.
I am actually surprised with the power draw, the box itself idles at 20W, which already amazes me for a Ryzen; when computing, I barely pass the 600W bar, and as I am not really using it to vibecode an entire system, I don't even notice the spikes on the power monitor (Shelly + homeassistant).
Which "good quality PCIe 4 riser" did you buy?
This one: https://es.aliexpress.com/item/1005010123289822.html?spm=a2g...