Recently after noticing how quickly limits are consumed and reading others complaints about same issue on reddit I was wondering how much about this is real error or bug hidden somewhere and how much it's about testing what threshold of constraining limits will be tolerated without cancelling accounts. Eventually, in case of "shit hits the fan" situation it can be always dismissed by waving hands and apologizing (or not) about some abstract "bug".
The lack of transparency and accountability behind all of this is incredible in my perception.
This feels a lot like the same playbook we’re seeing with dynamic pricing in retail, just applied to compute instead of products. You never really know what you’re getting, and the rules shift under you.
What makes it worse is the lack of transparency. If there were clear, hard limits, people could plan around it. Instead it’s this moving target that makes it impossible to trust for real work.
At some point it stops feeling like a bug and starts feeling like a pricing experiment on users.
Over reliance on LLMs is going to become such a disaster in a way no one would have thought possible. Not sure exactly what, who, when, or where.. Just that having your entire product or repo dependent on a single entity is going to lead to some bad times…
For a second I hoped you were gonna comment on how LLMs are going to rot out our skillset and our brains. Like some people already complaining they "have to think" when ChatGPT or Claude or Grok is down.
There's so many different models, from hosted to local and there's almost no switching cost as most of them are even api compatible or supported by one of the gateways (Bifrost, LiteLLM,...).
There's many things to worry about but which LLM provider you choose doesn't really lock you in right now.
Contrary to the popular opinion here, there are other services beyond Claude Code. These usage limits might even prompt (har har) people to notice that Gemini is cheaper and often better.
I don't get this pov, maybe b/c I'm not a heavy Claude Code user, just a dabbler. Any LLM tool that can selectively use part of a code base as part of the input prompt will be useful as an augmentation tool.
Note the word "any." Like cloud services there will be unique aspects of a tool, but just like cloud svc there is a shared basic value proposition allows for migration from one to another and competition among them. If Gemini or OpenAI or Ollama running locally becomes a better choice, I'll switch without a care.
Subscription sprawl is likely the more pressing issue (just remembered I should stop my GH CoPilot subscription since switching to Claude).
I find Claude code to be a token hog. No matter how confidently the papers say context rot is not an issue I find curating context to be highly important to output quality. Manually managing this in the Claude Webui has helped with my use cases more than freely tossing Claude code at it. Likely I am using both "wrong" but the way I use it is easier for me to reason about and minimize context rot.
One reddit user reverse engineered the binary and found that it was a cache invalidation issue.
They are doing some hidden string replacement if the claude code conversation talks about billing or tokens. Looks like that invalidates the cache at that point.
If that string appears anywhere in the conversation history, I think the starting text is replaced, your entire cache rebuilds from scratch.
Anecdotally when Claude was error 500'ing a few days ago, its retries would never succeed, but cancelling and retrying manually worked most of the time.
I cancelled my pro plan last month. I was using Claude as my daily driver. In fact had the API plan also and topped it with $20 more. So it was around $40 each month. Starting from December last year it has been like this. When sessions could last a couple of hours with some deep boilerplate and db queries etc. to architecture discussion and tool selection. Slowly the last two months it just gets over. One prompt and few discussions as to why this and not that and it is done.
This has been verified as a bug. Naturally, people should see some refunds or discounts, but I expect there won't be anything for you unless you make a stink.
Yesterday (pro plan) I ran one small conversation in which Claude did one set of three web searches, a very small conversation with no web search, and I added a single prompt to an existing long conversation. I was shocked to see after the last prompt that I had somehow hit my limit until 5:00pm. This account is not connected to an IDE or Code, super confusing.
There's a weird 'token anxiety' you get on these platforms. And you basically don't know how much of this 'limit' you may consume at any time. And you actually don't even know what the 'limit' is or how it's calculated. So far, people have just assumed Anthropic will do the kind thing and give you more than you could ever use...
This reminds me of the early days of cell phones. Limits everywhere and you paid for it by the kilobyte. Think at one point I was paying 45c per text message. I hope this gets better and we do not need gigawatt datacenters to do this stuff.
I'm finishing my annual paid Pro Gemini plan, so I'm on the free plan for Claude and I asked one (1) single question, which admittedly was about a research plan, using the Sonnet 4.6 Extended thinking model and instantly hit my limit until 2 PM (it was around 8 or 9 AM).
Just a shockingly constrained service tier right now.
claude automatically enabled "extra usage" on my pro account for me (I had it disabled) and the total got to $49 extra before I noticed. I sent an email asking wtf but I don't expect much.
I'm guessing their newer models are taking way more compute than they can afford to give away. The biggest challenge of AI will eventually be, how to bring down how much compute a powerful model takes. I hope Claude puts more emphasis into making Haiku and Sonnet better, when I use them via JetBrains AI it feels like only Opus is good enough, for whatever odd reason.
I get the same. Work has shifted to being agentic first - and whenever I use anything other than Claude Opus it seems that the model easily gets lost spinning its wheels on even the simplest query - especially with some of our more complex codebases, whereas Opus manages to not only reason adequately about the codebase, but also can produce decent quality code/tests in fairly short order.
Oddly though, when using at home I'm using Sonnet via the standard chat interface and that, whilst it will produce substandard code in its output is still reasonably capable - even in more niche tasks. Granted though that my personal projects are far simpler than the codebase I handle at work.
Anthropic went about this in a really dishonest way. They had increased demand, fine, but their response was to ban third-party clients (clients they were fine with before), and to semi-quietly reduce limits while keeping the price the same.
Unilaterally changing the deal to give customers less for the same price should not be legal, but companies have slowly boiled the frog in such a way that now we just go "welp, it's corporations, what can you do", and forget that we actually used to have some semblance of justice in the olden days.
When asking it to write a http library which can decode/parse/encode all three versions of it the usage limit of the day gets hit with one sentence. In the pro plan. Even when you hand it a library which does hpack/huffmann.
i just refuse to use openai/google/anthropic subscriptions, i only use open source models with ZDR tokens.
- i like privacy in my work, and i share when i wish. somehow we accepted that our prompts and work may be read and moderated by employees. would you accept people moderating what you write in excel, google docs, apple pages?
- i want a consistent tool, not something that is quantised one day, slow one day, a different harness one day, stops randomly.
- unless i am missing something, the closed source models are too slow for me to watch what they are doing. i feel comfortable with monitoring something, usually at about 200-300tps on GLM 5. above that it might even be too fast!
Its a question of price, quality and other factors.
If my company pays for it, i do not care.
If i have a hobby project were it is about converting an idea in my spare time in what i want, i'm happily paying 20$. I just did something like this on the weekend over a few hours. I really enjoy having small tools based on single html page with javascript and json as a data store (i ask it to also add an import/export feature so i can literaly edit it in the app and then save it and commit it).
For the main agent i'm waiting for like the one which will read my emails and will have access tos ystems? I would love a local setup but just buying some hardware today costs still a grant and a lot of energy. Its still sign cheaper to just use a subscription.
Not sure what you mean though regarding speed, they are super fast. I do not have a setup at home which can run 200-300 tps.
You are not crazy, you are just waking up from the SaaS delusion. We somehow allowed the industry to convince us that paying $20/month to rent volatile compute, have our proprietary workflows surveilled, and get throttled mid-thought is an 'upgrade'. The pendulum is swinging violently back to local-native tools. Deterministic, privately owned, unmetered—buying your execution layer instead of renting it is the only way to build actual leverage.
i would recommend getting an API account on fireworks, this is ZDR and typically the fastest provider.
otherwise check the list of providers on openrouter and you can see the pricing, quantisation, sign up directly rather than via a router. ensure to get caching prices, do not get input/output API prices.
GLM 5 is a frontier model, Kimi 2.5 is similar with vision support, Minimax M2.7 is a very capable model focused on tool calling.
If you need server side web search, you could use the Z AI API directly, again ZDR; or Friendli AI; or just install a search mcp.
For the harness opencode is the normal one, it has subagents and parallel tool calling; or just use claude code by pointing it at the anthropic APIs of various providers like fireworks.
If you want to still use APIs, I like OpenRouter because I can use the same credits across various models, so I'm not stuck with a single family of models. (Actually, you can even use the proprietary models on OpenRouter, but they're eye-wateringly expensive.)
Otherwise you should look into running e.g. Qwen3.5-35B-A3B or Qwen3.5-27B on your own computer. They're not Opus-level but from what I've heard they're capable for smaller tasks. llama.cpp works well for inference; it works well on both CPU and GPUs and even split across both if you want.
Recently after noticing how quickly limits are consumed and reading others complaints about same issue on reddit I was wondering how much about this is real error or bug hidden somewhere and how much it's about testing what threshold of constraining limits will be tolerated without cancelling accounts. Eventually, in case of "shit hits the fan" situation it can be always dismissed by waving hands and apologizing (or not) about some abstract "bug".
The lack of transparency and accountability behind all of this is incredible in my perception.
This feels a lot like the same playbook we’re seeing with dynamic pricing in retail, just applied to compute instead of products. You never really know what you’re getting, and the rules shift under you.
What makes it worse is the lack of transparency. If there were clear, hard limits, people could plan around it. Instead it’s this moving target that makes it impossible to trust for real work.
At some point it stops feeling like a bug and starts feeling like a pricing experiment on users.
Over reliance on LLMs is going to become such a disaster in a way no one would have thought possible. Not sure exactly what, who, when, or where.. Just that having your entire product or repo dependent on a single entity is going to lead to some bad times…
For a second I hoped you were gonna comment on how LLMs are going to rot out our skillset and our brains. Like some people already complaining they "have to think" when ChatGPT or Claude or Grok is down.
Oh well.
There's so many different models, from hosted to local and there's almost no switching cost as most of them are even api compatible or supported by one of the gateways (Bifrost, LiteLLM,...).
There's many things to worry about but which LLM provider you choose doesn't really lock you in right now.
> on a single entity
Contrary to the popular opinion here, there are other services beyond Claude Code. These usage limits might even prompt (har har) people to notice that Gemini is cheaper and often better.
I don't get this pov, maybe b/c I'm not a heavy Claude Code user, just a dabbler. Any LLM tool that can selectively use part of a code base as part of the input prompt will be useful as an augmentation tool.
Note the word "any." Like cloud services there will be unique aspects of a tool, but just like cloud svc there is a shared basic value proposition allows for migration from one to another and competition among them. If Gemini or OpenAI or Ollama running locally becomes a better choice, I'll switch without a care.
Subscription sprawl is likely the more pressing issue (just remembered I should stop my GH CoPilot subscription since switching to Claude).
So, like, GitHub then?
How can automatic slop-prevention be a disaster? It's a feature.
I find Claude code to be a token hog. No matter how confidently the papers say context rot is not an issue I find curating context to be highly important to output quality. Manually managing this in the Claude Webui has helped with my use cases more than freely tossing Claude code at it. Likely I am using both "wrong" but the way I use it is easier for me to reason about and minimize context rot.
This turned out to be a bug. https://x.com/om_patel5/status/2038754906715066444?s=20
One reddit user reverse engineered the binary and found that it was a cache invalidation issue.
They are doing some hidden string replacement if the claude code conversation talks about billing or tokens. Looks like that invalidates the cache at that point.
If that string appears anywhere in the conversation history, I think the starting text is replaced, your entire cache rebuilds from scratch.
So, nothing devious, just a bug.
https://xcancel.com/om_patel5/status/2038754906715066444
Nothing devious, but is Anthropic crediting users? In a sense, this is _like_ stealing from your customer, if they paid for something they never got.
That bug would only affect a conversation where that magic string is mentioned, which shouldn't be common.
Anecdotally when Claude was error 500'ing a few days ago, its retries would never succeed, but cancelling and retrying manually worked most of the time.
I cancelled my pro plan last month. I was using Claude as my daily driver. In fact had the API plan also and topped it with $20 more. So it was around $40 each month. Starting from December last year it has been like this. When sessions could last a couple of hours with some deep boilerplate and db queries etc. to architecture discussion and tool selection. Slowly the last two months it just gets over. One prompt and few discussions as to why this and not that and it is done.
After they force OpenCode to remove their Claude integration, and the insane token hogging, I also cancelled my subscription.
This has been verified as a bug. Naturally, people should see some refunds or discounts, but I expect there won't be anything for you unless you make a stink.
https://old.reddit.com/r/ClaudeCode/comments/1s7zg7h/investi...
I asked it to complete ONE task:
You've hit your limit · resets 2am (America/Los_Angeles)
I waited until the next day to ask it to do it again, and then:
You've hit your limit · resets 1pm (America/Los_Angeles)
At which point I just gave up
If this is reasonable or not is pretty hard to judge without any info on that "ONE" task.
Yesterday (pro plan) I ran one small conversation in which Claude did one set of three web searches, a very small conversation with no web search, and I added a single prompt to an existing long conversation. I was shocked to see after the last prompt that I had somehow hit my limit until 5:00pm. This account is not connected to an IDE or Code, super confusing.
Tool calls (particularly fetching for context) eats the context window heavily. I explicitly send MCP calls to sub agents because they are so “wordy”.
There's a weird 'token anxiety' you get on these platforms. And you basically don't know how much of this 'limit' you may consume at any time. And you actually don't even know what the 'limit' is or how it's calculated. So far, people have just assumed Anthropic will do the kind thing and give you more than you could ever use...
This reminds me of the early days of cell phones. Limits everywhere and you paid for it by the kilobyte. Think at one point I was paying 45c per text message. I hope this gets better and we do not need gigawatt datacenters to do this stuff.
I'm finishing my annual paid Pro Gemini plan, so I'm on the free plan for Claude and I asked one (1) single question, which admittedly was about a research plan, using the Sonnet 4.6 Extended thinking model and instantly hit my limit until 2 PM (it was around 8 or 9 AM).
Just a shockingly constrained service tier right now.
Free is free. Want more, fork over money.
claude automatically enabled "extra usage" on my pro account for me (I had it disabled) and the total got to $49 extra before I noticed. I sent an email asking wtf but I don't expect much.
I'm guessing their newer models are taking way more compute than they can afford to give away. The biggest challenge of AI will eventually be, how to bring down how much compute a powerful model takes. I hope Claude puts more emphasis into making Haiku and Sonnet better, when I use them via JetBrains AI it feels like only Opus is good enough, for whatever odd reason.
I get the same. Work has shifted to being agentic first - and whenever I use anything other than Claude Opus it seems that the model easily gets lost spinning its wheels on even the simplest query - especially with some of our more complex codebases, whereas Opus manages to not only reason adequately about the codebase, but also can produce decent quality code/tests in fairly short order.
Oddly though, when using at home I'm using Sonnet via the standard chat interface and that, whilst it will produce substandard code in its output is still reasonably capable - even in more niche tasks. Granted though that my personal projects are far simpler than the codebase I handle at work.
Anthropic went about this in a really dishonest way. They had increased demand, fine, but their response was to ban third-party clients (clients they were fine with before), and to semi-quietly reduce limits while keeping the price the same.
Unilaterally changing the deal to give customers less for the same price should not be legal, but companies have slowly boiled the frog in such a way that now we just go "welp, it's corporations, what can you do", and forget that we actually used to have some semblance of justice in the olden days.
When asking it to write a http library which can decode/parse/encode all three versions of it the usage limit of the day gets hit with one sentence. In the pro plan. Even when you hand it a library which does hpack/huffmann.
please tell me if i'm crazy.
i just refuse to use openai/google/anthropic subscriptions, i only use open source models with ZDR tokens.
- i like privacy in my work, and i share when i wish. somehow we accepted that our prompts and work may be read and moderated by employees. would you accept people moderating what you write in excel, google docs, apple pages?
- i want a consistent tool, not something that is quantised one day, slow one day, a different harness one day, stops randomly.
- unless i am missing something, the closed source models are too slow for me to watch what they are doing. i feel comfortable with monitoring something, usually at about 200-300tps on GLM 5. above that it might even be too fast!
Its a question of price, quality and other factors.
If my company pays for it, i do not care.
If i have a hobby project were it is about converting an idea in my spare time in what i want, i'm happily paying 20$. I just did something like this on the weekend over a few hours. I really enjoy having small tools based on single html page with javascript and json as a data store (i ask it to also add an import/export feature so i can literaly edit it in the app and then save it and commit it).
For the main agent i'm waiting for like the one which will read my emails and will have access tos ystems? I would love a local setup but just buying some hardware today costs still a grant and a lot of energy. Its still sign cheaper to just use a subscription.
Not sure what you mean though regarding speed, they are super fast. I do not have a setup at home which can run 200-300 tps.
You are not crazy, you are just waking up from the SaaS delusion. We somehow allowed the industry to convince us that paying $20/month to rent volatile compute, have our proprietary workflows surveilled, and get throttled mid-thought is an 'upgrade'. The pendulum is swinging violently back to local-native tools. Deterministic, privately owned, unmetered—buying your execution layer instead of renting it is the only way to build actual leverage.
The first hit is free.
I literally ran out of tokens on the antigravity top plan after 4 new questions the other day (opus). Total scam. Not impressed.
What is the best way to get start with open weight models? And are they a good alternative to Claude Code?
i would recommend getting an API account on fireworks, this is ZDR and typically the fastest provider.
otherwise check the list of providers on openrouter and you can see the pricing, quantisation, sign up directly rather than via a router. ensure to get caching prices, do not get input/output API prices.
GLM 5 is a frontier model, Kimi 2.5 is similar with vision support, Minimax M2.7 is a very capable model focused on tool calling.
If you need server side web search, you could use the Z AI API directly, again ZDR; or Friendli AI; or just install a search mcp.
For the harness opencode is the normal one, it has subagents and parallel tool calling; or just use claude code by pointing it at the anthropic APIs of various providers like fireworks.
If you want to still use APIs, I like OpenRouter because I can use the same credits across various models, so I'm not stuck with a single family of models. (Actually, you can even use the proprietary models on OpenRouter, but they're eye-wateringly expensive.)
Otherwise you should look into running e.g. Qwen3.5-35B-A3B or Qwen3.5-27B on your own computer. They're not Opus-level but from what I've heard they're capable for smaller tasks. llama.cpp works well for inference; it works well on both CPU and GPUs and even split across both if you want.
We offer multiple SOA models at https://portal.neuralwatt.com at very generous pricing since we have options to bill per kWh instead of per token. Recipes for your favorite tools here: https://github.com/neuralwatt/neuralwatt-tools
Just install ollama.
And no, they're not as capable as SOTA models. Not by far.
However they can help reduce your token expenditure a lot by routing them the low-hanging fruit. Summaries, translations, stuff like that.