Reading their modified license terms, it cracks me up, because they've basically remade the MIT to be the MIT + the one clause that the BSD used to have, which didn't care about MAU or revenue, if you used it in a product, they asked you to 'advertise' them basically. Honestly, its a reasonable request.
Personally, when I use open code or routers, I feel that beyond a certain level, the models don't make a huge difference to me. Except for expensive and mediocre models like Gemini. In that sense, Chinese models are pretty good. I usually write code in function or method units and then design and assemble them together.
GPT series models are more thorough and better, but I'm not sure if the difference is enormous. It seems to depend on the workflow, but in my opinion, if you are thorough enough, I wonder if there really is a big difference
I really hope we stop using the term "Chinese models". It has this air of Negative connotation. It's the equivalent of calling cars Japanese, which people used to do but now is almost entirely meaningless. You just call them Toyota, Honda, Lexus etc.
I would really love to know if anyone has any experience with something like opencode + Kimi K2.6/2.7 now compared to Claude Code. What is better, what is worse, what is the cost comparison. I am currently paying $100 for the 5x Max plan, but Fable is running through the usage limits quite drastically and I cannot really say it's night and day compared to Opus. Also, I use this mostly for my side projects, so the $100 bill is quite noticeable. I definitely don't want to pay more.
I do have this experience. I've used Claude Code (with Opus mostly), and then switched to opencode (mostly with Kimi 2.6) for my personal projects; it's based on a couple months of use.
Claude Code is better. But Opencode + kimi 2.6 is workable, which is big. For bare code writing, if you know what exactly you want, most popular models are fine (deepseek, kimi, etc), it feels more or less the same as anthropic models.
At the same time, Opus seems to understand my intent way better than e.g. deepseek. I need to be much more precise with my prompts when using deepseek - it often goes in a wrong direction if I'm lazy. This results in a workflow which feels quite a lot different from Claude Code.
Kimi is in between - for me it brings back "lazy prompting" workflow, and I can trust its plans more than deepseek. It enables a workflow similar to Claude Code, it's workable, but it is a bit worse everywhere. Smaller context, a bit more errors, decisions are a bit worse, recommendations are a bit worse, debugging capabilities are a bit worse, etc.
On the usage side, $100 Claude plan is a great value actually. On paper, per-token kimi is way cheaper, but Claude subscriptions are heavily subsidized - you get much more tokens than $100 can buy you. So, in the end, opencode + kimi vs claude code could be of a similar cost, for similar usage patterns. Deepseek can be cheaper, and it has insanely cheap cached tokens, but experience may vary - depending on your habits, you may need to adjust how you work, coming from claude code.
I'd say for side projects something like $10 Opencode Go plan + $10 of extra DeepSeek v4 credits (e.g. on OpenRouter) can be very workable.
>At the same time, Opus seems to understand my intent way better than e.g. deepseek. I need to be much more precise with my prompts when using deepseek - it often goes in a wrong direction if I'm lazy. This results in a workflow which feels quite a lot different from Claude Code.
how much of that is Opus injecting prior conversations from memory?
Almost none of it, if you're using Claude Code. Until recently Claude only had the option of retaining memory across conversations for the desktop app.
I almost never use the desktop app, I have maybe 2-3 conversations over the last year that have nothing to do with my job. Opus (and now Fable) genuinely do seem to "understand" what you intend based off what you're explaining a lot better than other models I've tried.
Gemini gets close in some cases, but it falls over in the actual implementation sometimes. I haven't tried Kimi yet but MiMo isn't too shabby either.
I use Claude at work and Kimi for side projects. My org has LiteLLM and Kimi 2.5 enabled but it rarely works, so Claude and GPT are my main tools. I actually enjoy Kimi more as it feels like a dev in a job interview. Watching it reason through problems is a lot like I tend to explain things during whiteboarding sessions. The number of times it says, "wait", is just funny. Claude on the other hand is much more like an employee (or team of employees) that already know they have the job. It doesn't do a ton of explanation up front. (you can dig into processes if you want). It just goes along, asking questions only when it needs... and then delivers a comprehensive report or plan. OpenCode is a better harness. I don't have a direct comparison on costs, as I haven't tried to do the exact same prompt on both models. I can say that I recently had Kimi generate a wrapper around libpq for the ZenC programming language: https://github.com/nobleach/zenc-postgres and it took about an hour or so and cost around 4 dollars.
For some reason I never had a good experience with Kimi (via OpenRouter) in OpenCode. It would only take a few turns for it to run off and mess something up. Terrible instruction following I’d say.
I use DeepSeek V4 Pro now, which works pretty well.
I am extremely happy with ohmypi, but you could use OpenCode or just keep using Claude Code!
DeepSeek-V4-Pro is adequate plus use DS4-Flash for tasks or other small activity you’d use Haiku or Sonnet for. Go sign up with $10 prepaid.
OpenCode Go - go sign up with $5 for a month and use Qwen-3.7-Max for design/plan/architecture or difficult troubleshooting. Feels closer to Opus 3.6 or 3.7 than DeepSeek, closest I’ve found.
OpenAI Codex, $20 a month plan, use GPT-5.5 via API for the same design/plan/architecture/troubleshooting/author commits. (You can also pay $100 and cut and paste really difficult problems into chat with the GPT-5.5-Pro model.)
Xiaomi MiMo-2.5-Pro, find a friend to give you a $2 referral code, you get 72 cents free. Same pricing as DeepSeek. Somewhere between Sonnet and Opus, quite capable. Apply for the UltraSpeed beta too.
You can switch in and out from these models on the fly in OpenCode or ohmypi and simply find the one that feels best to you. I use CodexBar to watch consumption in near real time.
For a casual user or someone new to programming, Cursor’s $20 plan is an excellent start with Composer-2.5 and Composer-2.5-Fast. You get an API allowance too you can use to access Opus-4.x or GPT-5.5-Pro from OpenCode or ohmypi in addition to Cursor itself.
Finally, if you use Grok or Twitter, SuperGrok at $30 a month has a good vision model, which I used for automated testing of front ends. I’m migrating to locally-run Qwen-3-VL on a commodity Mac, though. If you’re less technical unreach makes hosting local models on a Mac easy.
If you have a powerful GPU like an RTX 5090, try Qwen-3.6 locally on that too. Use ollama or llama-swap which is fairly easy to use.
I have not tried new Kimi yet but we have been able to keep our costs at or below $200 a month per employee with a team of 3 professional developers, 1 graphic designer who uses a lot of Midjourney and Grok Imagine now driven from workflows she made herself in ohmypi, and 1 nontechnical user (account manager / project manager) who uses ohmypi to help her gather requirements and track implementation of them. With a tiny bit of effort we could get that number closer to $75 per employee per month.
This was my experience using GLM 5.1 in Claude Code but it works far better in OpenCode, I’d really like to understand why. I think it’s a bit stronger than Sonnet 4.6.
I use the oh-my-openagent planning system and haven’t used vanilla OpenCode enough to know how much that is contributing.
The answer is easy, CC is bug for bug optimized for Anthropic models. They don't even test it with other models, let alone provide support for all small compatibility quirks of different provider implementations.
On the other hand, Opencode, Pi agent and other open source tool offer much better support for all models, including open source.
I think there is some threshold after which "best" model doesn't matter, we are not that far from it. Fable now is really good, in a year or so, if Kimi catches up, even if Fable6 is much better, I think I will use kimi at 1/10th of the price.
I said that about opus 4.5 at the time, thinking "this is so good, in 6-12 months the Chinese models will be as good and cheap, I will use them", but I was wrong.. I pay premium for opus4.7/8 and Fable.
But at some point, it will just do the thing you want it to do, and then the race to the bottom will start.
Now that Chinese companies have access to some very good Fable tokens, I hope it speeds up the race.
yes, that is my point, but at some point, better is unmeasurable, and both the better and the not-as-good produce similar result, and then you pick the one with 1/10th of the price
I was wondering how does Anthropic and likes keep competitive when Opus is ($5 / $25) 5x times more expensive compared to Kimi K2.6 ($0.7 / $3.4) or other Chinese models, while being only marginally better.
My theory is that US enterprise just can't send data to Chinese and that's understandable, but is that "the moat"?
API token price is one thing, but subscriptions on Claude are a good value. Weirdly everyone says that Claude subscriptions are subsidized because of the API price, even though (1) no one actually knows Claude's cost of inference, and (2) Chinese providers are also able to provide cheap inference, so why do they think Claude can't?
I also wonder if Enterprises have deals for other API pricing that is not posted publicly, so all we see is a high API sticker price.
The moat right now is model performance and what that means for how many tokens and additional time you spend.
I say this as a relatively frequent user of Kimi models and generally a big fan. But on not-yet-gamed benchmarks like DeepSWE, Kimi K2.6 is beaten soundly by Claude Sonnet 4.6 ($3 / $15) and even slightly by GPT 5.4 Mini ($0.75 / $4.50).
There's no question Kimi models are very good for a lot of code tasks. They're the best quality open weight model. But to get similar overall outcomes as on Sonnet/Opus, on average you'll spend many more tokens and will have to do more managing of the model. You shouldn't look at price per token, you should look at how much you pay for the entire process.
I'm more interested in how much effort I have to put in, at least while I'm paying in the range of current subscriptions (so ~€100-€200 a month or so). If the prices go up much more than that I'll have to switch to caring more about token efficiency. But at current pricing the bottleneck is my attention, not model efficiency. As such, even a small improvement in model quality - and hence, a decrease in how much attention I have to spend on it - makes a big difference.
Any benchmark is iffy and has weird results, but this is the best we got at the moment. Most people working with Opus and Kimi would likely tell you they're much further apart than the numbers that were quoted for Kimi K2.6, and DeepSWE seems to capture that gap better.
One major thing DeepSWE has going for it is that all other benchmarks quoted by MoonshotAI on this page are benchmarks that are gamed. The benchmark answers are public and part of each model's training data. This benchmark may still be iffy, but at least it's not gamed.
I think the perception is that it is not 'only marginally better'; whether or not you specifically agree that perceived quality gap lets them differentiate on price.
I'd further say that there are probably enough rational actors running evals out there that the marginally better is not pure vibes for the cases where people are spending lots of money, but I only have direct line of sight to some of those eval suites. Maybe everyone is irrational and anthropic is exploiting that!
I think most people who've tried them both would tell you Anthropic's models are more than marginally better than Kimi. Kimi and the other open source models may score well on SWE-bench or whatever but the gap is noticeable IMHO once you actually try to use them.
I reckon right now the Enterprise concern is more FOMO around the AI wave and how to retrain or replace up to hundreds of thousands of employees. I don't think cost is the main concern right now.
But if AI doesn't lead quickly to vast large scale replacement of workers as promised, I could definitely see the C-suits and their gaggle of consultants starting to ask questions about token pricing.
I think none of them having a defacto and high quality English focused cli is a big part of it. None of the Chinese models I've tried have worked well in opensource cli's. Granted, I've only tried a few, but still...
I want Opus to be only marginally better, but I do mostly research engineering and its ability to not fuck up my projects is absent. Every time my credits lapse I let kimi and composer2.5 have some play and it’s basically just an excuse for me to keep playing computer because when the oai/ant credits refresh I always need to spend hours recovering from the other models either misconceptions or boneheaded eng practices. Even when I only let it touch my web games…
I think any new model not demonstrably maybe 20-30% over Deepseek v4 capabilities priced over the price per token of Deepseek is almost automatically deprecated as low use model (maybe for Planning).
This maps to what I'm seeing in practice. The gap between demo and production is consistently underestimated, especially around error handling and edge cases.
I am still very new to the open-weight/source models. If anyone is using them full-time, I’d really love to hear about the setup and how they perform, as I am considering moving my org off Anthropic products.
For personal stuff I use forgecode with openrouter. Firstly, forgecode is a much better harness than Cloude code (IMHO).
Anyway, regarding the models, my experience is that there is not much difference in terms of quality, but the cost difference is insane. At least for how I use agents. Yesterday's example is the following: I am developing a small DSL for search across complex technical documents. I wanted to add a small operator to it and thought that to give fable a spin. It burned through 13 USD and while it delivered the solution it wasn't objectively better than what Deepseek v4 did for 1.7 dollars (same exact task because I was curious).
For full disclosure, I ask agents for piecemeal stuff. Like in the DSL case, I designed the operators and then asked agents to implement them one by one. Probably if I asked to design the whole thing starting from these complex documents Fable would shine, but every time I try to give agents broader scope tasks they burn through millions of tokens, generate questionable code, which I have to spend time familiarize myself with.
These models have open weights, but at the moment most flagship models are practically accessible only through third-party model providers. The main exception is models in the ~30B parameter range, which can still be run on consumer-grade GPUs. That said, even consumer GPUs have become increasingly expensive and difficult to justify in recent years.
I have been using deepseek v4 flash as my main model for everything ever since dwarf star came out. I run it on my M4 Max MacBook Pro with 128gb of memory. I run it usually as a server and connect to it over tailscale with my coding machine and use the Pi coding agent. It’s a big leap over using the Qwen models though it doesn’t have vision - so I still will run those when I use vision. GLM 4.7 flash was my previous go to for coding but I’ve completely switched to deepseek for all non-vision things.
I use glm5.1 plus pi with a few customized skills and am very happy with it. I hadn’t touched my Claude 5x plan for a couple of weeks but opened it back up in Claude code when fable was released and did a few tasks and still was happy to return to glm/pi.
I keep trying to switch to the Chinese models, but I keep finding myself asking Claude to fix their outputs. (Both functionality and style.) So I always end up switching back.[0]
I also keep trying GPT, which is quite solid. Very fast, great at debugging. But its code is often overly clever and hurts my brain.
(Maybe fixable with prompting. I tried and it helped the Chinese ones a bit. Just tell them do be elegant, like in the old image AI days "+good -bad"!)
For now I do still need my human brain to actually be able to make sense of the stuff, and Claude is the only one that consistently meets that requirement.
But I am hoping that one of these days, one of the Chinese labs figures out the special sauce :)
--
[0] (For smallish edits, though, I am having a great time with DeepSeek Flash. Practically unlimited AI on tap! How cool is that.)
Reading their modified license terms, it cracks me up, because they've basically remade the MIT to be the MIT + the one clause that the BSD used to have, which didn't care about MAU or revenue, if you used it in a product, they asked you to 'advertise' them basically. Honestly, its a reasonable request.
This is the cursor callout.
Don't make us shame you into disclosure
Ah is that what it is? I don't use Cursor, never saw it as being relevant to me, but would not surprise me.
I think deepseek has crossed the threshold for being on par with opus 4.6 and kimi is doing a great job in shipping velocity.
Personally, when I use open code or routers, I feel that beyond a certain level, the models don't make a huge difference to me. Except for expensive and mediocre models like Gemini. In that sense, Chinese models are pretty good. I usually write code in function or method units and then design and assemble them together.
GPT series models are more thorough and better, but I'm not sure if the difference is enormous. It seems to depend on the workflow, but in my opinion, if you are thorough enough, I wonder if there really is a big difference
In my experience, there's little difference between implementing individual functions between frontier models and SotA ~30B param models.
Once you have a coherent design (the hard part), you can feed it to a pretty small model and get basically the same quality.
They'll not one-shot, but they're faster and cheaper, so it still works out in your favor.
Plus you can do it locally...
I have a similar experience. However, when including code review, I think the GPT model is the most impressive
I really hope we stop using the term "Chinese models". It has this air of Negative connotation. It's the equivalent of calling cars Japanese, which people used to do but now is almost entirely meaningless. You just call them Toyota, Honda, Lexus etc.
You are right. I agree
I would really love to know if anyone has any experience with something like opencode + Kimi K2.6/2.7 now compared to Claude Code. What is better, what is worse, what is the cost comparison. I am currently paying $100 for the 5x Max plan, but Fable is running through the usage limits quite drastically and I cannot really say it's night and day compared to Opus. Also, I use this mostly for my side projects, so the $100 bill is quite noticeable. I definitely don't want to pay more.
I do have this experience. I've used Claude Code (with Opus mostly), and then switched to opencode (mostly with Kimi 2.6) for my personal projects; it's based on a couple months of use.
Claude Code is better. But Opencode + kimi 2.6 is workable, which is big. For bare code writing, if you know what exactly you want, most popular models are fine (deepseek, kimi, etc), it feels more or less the same as anthropic models.
At the same time, Opus seems to understand my intent way better than e.g. deepseek. I need to be much more precise with my prompts when using deepseek - it often goes in a wrong direction if I'm lazy. This results in a workflow which feels quite a lot different from Claude Code.
Kimi is in between - for me it brings back "lazy prompting" workflow, and I can trust its plans more than deepseek. It enables a workflow similar to Claude Code, it's workable, but it is a bit worse everywhere. Smaller context, a bit more errors, decisions are a bit worse, recommendations are a bit worse, debugging capabilities are a bit worse, etc.
On the usage side, $100 Claude plan is a great value actually. On paper, per-token kimi is way cheaper, but Claude subscriptions are heavily subsidized - you get much more tokens than $100 can buy you. So, in the end, opencode + kimi vs claude code could be of a similar cost, for similar usage patterns. Deepseek can be cheaper, and it has insanely cheap cached tokens, but experience may vary - depending on your habits, you may need to adjust how you work, coming from claude code.
I'd say for side projects something like $10 Opencode Go plan + $10 of extra DeepSeek v4 credits (e.g. on OpenRouter) can be very workable.
>At the same time, Opus seems to understand my intent way better than e.g. deepseek. I need to be much more precise with my prompts when using deepseek - it often goes in a wrong direction if I'm lazy. This results in a workflow which feels quite a lot different from Claude Code.
how much of that is Opus injecting prior conversations from memory?
Almost none of it, if you're using Claude Code. Until recently Claude only had the option of retaining memory across conversations for the desktop app.
I almost never use the desktop app, I have maybe 2-3 conversations over the last year that have nothing to do with my job. Opus (and now Fable) genuinely do seem to "understand" what you intend based off what you're explaining a lot better than other models I've tried.
Gemini gets close in some cases, but it falls over in the actual implementation sometimes. I haven't tried Kimi yet but MiMo isn't too shabby either.
I use Claude at work and Kimi for side projects. My org has LiteLLM and Kimi 2.5 enabled but it rarely works, so Claude and GPT are my main tools. I actually enjoy Kimi more as it feels like a dev in a job interview. Watching it reason through problems is a lot like I tend to explain things during whiteboarding sessions. The number of times it says, "wait", is just funny. Claude on the other hand is much more like an employee (or team of employees) that already know they have the job. It doesn't do a ton of explanation up front. (you can dig into processes if you want). It just goes along, asking questions only when it needs... and then delivers a comprehensive report or plan. OpenCode is a better harness. I don't have a direct comparison on costs, as I haven't tried to do the exact same prompt on both models. I can say that I recently had Kimi generate a wrapper around libpq for the ZenC programming language: https://github.com/nobleach/zenc-postgres and it took about an hour or so and cost around 4 dollars.
For some reason I never had a good experience with Kimi (via OpenRouter) in OpenCode. It would only take a few turns for it to run off and mess something up. Terrible instruction following I’d say.
I use DeepSeek V4 Pro now, which works pretty well.
I am extremely happy with ohmypi, but you could use OpenCode or just keep using Claude Code!
DeepSeek-V4-Pro is adequate plus use DS4-Flash for tasks or other small activity you’d use Haiku or Sonnet for. Go sign up with $10 prepaid.
OpenCode Go - go sign up with $5 for a month and use Qwen-3.7-Max for design/plan/architecture or difficult troubleshooting. Feels closer to Opus 3.6 or 3.7 than DeepSeek, closest I’ve found.
OpenAI Codex, $20 a month plan, use GPT-5.5 via API for the same design/plan/architecture/troubleshooting/author commits. (You can also pay $100 and cut and paste really difficult problems into chat with the GPT-5.5-Pro model.)
Xiaomi MiMo-2.5-Pro, find a friend to give you a $2 referral code, you get 72 cents free. Same pricing as DeepSeek. Somewhere between Sonnet and Opus, quite capable. Apply for the UltraSpeed beta too.
You can switch in and out from these models on the fly in OpenCode or ohmypi and simply find the one that feels best to you. I use CodexBar to watch consumption in near real time.
For a casual user or someone new to programming, Cursor’s $20 plan is an excellent start with Composer-2.5 and Composer-2.5-Fast. You get an API allowance too you can use to access Opus-4.x or GPT-5.5-Pro from OpenCode or ohmypi in addition to Cursor itself.
Finally, if you use Grok or Twitter, SuperGrok at $30 a month has a good vision model, which I used for automated testing of front ends. I’m migrating to locally-run Qwen-3-VL on a commodity Mac, though. If you’re less technical unreach makes hosting local models on a Mac easy.
If you have a powerful GPU like an RTX 5090, try Qwen-3.6 locally on that too. Use ollama or llama-swap which is fairly easy to use.
I have not tried new Kimi yet but we have been able to keep our costs at or below $200 a month per employee with a team of 3 professional developers, 1 graphic designer who uses a lot of Midjourney and Grok Imagine now driven from workflows she made herself in ohmypi, and 1 nontechnical user (account manager / project manager) who uses ohmypi to help her gather requirements and track implementation of them. With a tiny bit of effort we could get that number closer to $75 per employee per month.
I can only talk about GLM 5.1 which is roughly at sonnet 4 levels imo.
It's good, does most tasks well that I throw at it, but will fail at anything congitive/complex. It gets stuck often. It costs ~6$ a month though
This was my experience using GLM 5.1 in Claude Code but it works far better in OpenCode, I’d really like to understand why. I think it’s a bit stronger than Sonnet 4.6.
I use the oh-my-openagent planning system and haven’t used vanilla OpenCode enough to know how much that is contributing.
The answer is easy, CC is bug for bug optimized for Anthropic models. They don't even test it with other models, let alone provide support for all small compatibility quirks of different provider implementations.
On the other hand, Opencode, Pi agent and other open source tool offer much better support for all models, including open source.
The Kimi problem is it doesn’t follow instructions and goes off track often.
Other than that it’s pretty decent (for the price).
Sounds like it was distilled from Claude. I don't understand the appeal of an agent that does whatever it wants.
If you ask Claude in Chinese to introduce itself, it will claim it's Kimi :)
This. It will try to fix and refactor things that don’t need fixing because it gets stuck trying to solve the problem at hand.
I think there is some threshold after which "best" model doesn't matter, we are not that far from it. Fable now is really good, in a year or so, if Kimi catches up, even if Fable6 is much better, I think I will use kimi at 1/10th of the price.
I said that about opus 4.5 at the time, thinking "this is so good, in 6-12 months the Chinese models will be as good and cheap, I will use them", but I was wrong.. I pay premium for opus4.7/8 and Fable.
But at some point, it will just do the thing you want it to do, and then the race to the bottom will start.
Now that Chinese companies have access to some very good Fable tokens, I hope it speeds up the race.
price/token isnt the only thing relevant. if you have to ask the AI again, it'll cost you more than when it gets things right in the first place.
so better models may still be cheaper even if the price per token is higher.
yes, that is my point, but at some point, better is unmeasurable, and both the better and the not-as-good produce similar result, and then you pick the one with 1/10th of the price
Depending on who you are and how you use these models, we're already at this point
I was wondering how does Anthropic and likes keep competitive when Opus is ($5 / $25) 5x times more expensive compared to Kimi K2.6 ($0.7 / $3.4) or other Chinese models, while being only marginally better.
My theory is that US enterprise just can't send data to Chinese and that's understandable, but is that "the moat"?
API token price is one thing, but subscriptions on Claude are a good value. Weirdly everyone says that Claude subscriptions are subsidized because of the API price, even though (1) no one actually knows Claude's cost of inference, and (2) Chinese providers are also able to provide cheap inference, so why do they think Claude can't?
I also wonder if Enterprises have deals for other API pricing that is not posted publicly, so all we see is a high API sticker price.
The moat right now is model performance and what that means for how many tokens and additional time you spend.
I say this as a relatively frequent user of Kimi models and generally a big fan. But on not-yet-gamed benchmarks like DeepSWE, Kimi K2.6 is beaten soundly by Claude Sonnet 4.6 ($3 / $15) and even slightly by GPT 5.4 Mini ($0.75 / $4.50).
There's no question Kimi models are very good for a lot of code tasks. They're the best quality open weight model. But to get similar overall outcomes as on Sonnet/Opus, on average you'll spend many more tokens and will have to do more managing of the model. You shouldn't look at price per token, you should look at how much you pay for the entire process.
I'm more interested in how much effort I have to put in, at least while I'm paying in the range of current subscriptions (so ~€100-€200 a month or so). If the prices go up much more than that I'll have to switch to caring more about token efficiency. But at current pricing the bottleneck is my attention, not model efficiency. As such, even a small improvement in model quality - and hence, a decrease in how much attention I have to spend on it - makes a big difference.
I'm not sure I would put too much weight on DeepSWE as a benchmark, given that GPT-5.4-mini ended up close to Opus 4.6 there.
Any benchmark is iffy and has weird results, but this is the best we got at the moment. Most people working with Opus and Kimi would likely tell you they're much further apart than the numbers that were quoted for Kimi K2.6, and DeepSWE seems to capture that gap better.
One major thing DeepSWE has going for it is that all other benchmarks quoted by MoonshotAI on this page are benchmarks that are gamed. The benchmark answers are public and part of each model's training data. This benchmark may still be iffy, but at least it's not gamed.
I think the perception is that it is not 'only marginally better'; whether or not you specifically agree that perceived quality gap lets them differentiate on price.
I'd further say that there are probably enough rational actors running evals out there that the marginally better is not pure vibes for the cases where people are spending lots of money, but I only have direct line of sight to some of those eval suites. Maybe everyone is irrational and anthropic is exploiting that!
I think most people who've tried them both would tell you Anthropic's models are more than marginally better than Kimi. Kimi and the other open source models may score well on SWE-bench or whatever but the gap is noticeable IMHO once you actually try to use them.
I reckon right now the Enterprise concern is more FOMO around the AI wave and how to retrain or replace up to hundreds of thousands of employees. I don't think cost is the main concern right now.
But if AI doesn't lead quickly to vast large scale replacement of workers as promised, I could definitely see the C-suits and their gaggle of consultants starting to ask questions about token pricing.
I think none of them having a defacto and high quality English focused cli is a big part of it. None of the Chinese models I've tried have worked well in opensource cli's. Granted, I've only tried a few, but still...
i use github copilot cli + openrouter + qwen 3.7 max and it's really much better than i expected (used to opus 4.7 at work)
I want Opus to be only marginally better, but I do mostly research engineering and its ability to not fuck up my projects is absent. Every time my credits lapse I let kimi and composer2.5 have some play and it’s basically just an excuse for me to keep playing computer because when the oai/ant credits refresh I always need to spend hours recovering from the other models either misconceptions or boneheaded eng practices. Even when I only let it touch my web games…
> My theory is that US enterprise just can't send data to Chinese
Lots of US providers are hosting these “open source” models so doubt that’s the problem.
I think any new model not demonstrably maybe 20-30% over Deepseek v4 capabilities priced over the price per token of Deepseek is almost automatically deprecated as low use model (maybe for Planning).
Is Deepseek just eating cost or are people able to host their open models for comparable costs?
If openrouter is to be trusted, the cheapest offers that are not from Deepseek itself are:
- twice as expensive on the output (1.52 vs 0.87)
- six times as expensive on the input (0.33 vs 0.05)
https://openrouter.ai/deepseek/deepseek-v4-pro?sort=price#pr...
Other people are hosting it in the same order of magnitude. Xioami recently matched DeepSeek’s pricing.
They focused on caching and other optimizations.
Likely CCP-subsidized
This maps to what I'm seeing in practice. The gap between demo and production is consistently underestimated, especially around error handling and edge cases.
I am still very new to the open-weight/source models. If anyone is using them full-time, I’d really love to hear about the setup and how they perform, as I am considering moving my org off Anthropic products.
Anecdotal, but here's my experience.
For personal stuff I use forgecode with openrouter. Firstly, forgecode is a much better harness than Cloude code (IMHO).
Anyway, regarding the models, my experience is that there is not much difference in terms of quality, but the cost difference is insane. At least for how I use agents. Yesterday's example is the following: I am developing a small DSL for search across complex technical documents. I wanted to add a small operator to it and thought that to give fable a spin. It burned through 13 USD and while it delivered the solution it wasn't objectively better than what Deepseek v4 did for 1.7 dollars (same exact task because I was curious).
For full disclosure, I ask agents for piecemeal stuff. Like in the DSL case, I designed the operators and then asked agents to implement them one by one. Probably if I asked to design the whole thing starting from these complex documents Fable would shine, but every time I try to give agents broader scope tasks they burn through millions of tokens, generate questionable code, which I have to spend time familiarize myself with.
These models have open weights, but at the moment most flagship models are practically accessible only through third-party model providers. The main exception is models in the ~30B parameter range, which can still be run on consumer-grade GPUs. That said, even consumer GPUs have become increasingly expensive and difficult to justify in recent years.
You can definitely go above 30B on consumer hardware – 2x gpus, spark, mac, half byte quants etc.
I have been using deepseek v4 flash as my main model for everything ever since dwarf star came out. I run it on my M4 Max MacBook Pro with 128gb of memory. I run it usually as a server and connect to it over tailscale with my coding machine and use the Pi coding agent. It’s a big leap over using the Qwen models though it doesn’t have vision - so I still will run those when I use vision. GLM 4.7 flash was my previous go to for coding but I’ve completely switched to deepseek for all non-vision things.
Qwen 3.6 seems to be the strongest local models, works OK on an RTX 5090 or a > 32GB Mac.
I use glm5.1 plus pi with a few customized skills and am very happy with it. I hadn’t touched my Claude 5x plan for a couple of weeks but opened it back up in Claude code when fable was released and did a few tasks and still was happy to return to glm/pi.
Better than Qwen3.6-35B-A3B-8bit ?
When I tried glm found it way way slower (omlx as runtime)
I keep trying to switch to the Chinese models, but I keep finding myself asking Claude to fix their outputs. (Both functionality and style.) So I always end up switching back.[0]
I also keep trying GPT, which is quite solid. Very fast, great at debugging. But its code is often overly clever and hurts my brain.
(Maybe fixable with prompting. I tried and it helped the Chinese ones a bit. Just tell them do be elegant, like in the old image AI days "+good -bad"!)
For now I do still need my human brain to actually be able to make sense of the stuff, and Claude is the only one that consistently meets that requirement.
But I am hoping that one of these days, one of the Chinese labs figures out the special sauce :)
--
[0] (For smallish edits, though, I am having a great time with DeepSeek Flash. Practically unlimited AI on tap! How cool is that.)
Benchmark geometric mean
- GPT-5.5: 62.7%
- Opus 4.8: 62.2%
- Kimi K2.7 Code: 56.3%
- Kimi K2.6: 48.2%
How is 2.7 a thing _now_ ? it's not even mentioned on moonshot's webpage..
It's not 2.7. It's 2.7-Code, and it's 2.6 token-optimised for coding.
https://platform.kimi.ai/docs/guide/kimi-k2-7-code-quickstar...
insanely great!