The fact that management signed off on measuring AI use through token usage shows how incompetent management really is, including in allegedly technical conmpanies like Amazon. Tokenmaxxing was an entirely expected and rational response. IOW You measure employees in stupid ways, you're going to get stupid behaviour as a consequence.
One argument I have heard in favour of this is that management knew this would be a side effect, but that it's more important to have people engage with AI as much as possible simply to explore what is actually possible. You are effectively knowingly wasting money in the expectation that you might learn something useful that will be more valuable in the long run.
My questions for that approach are: Why treat AI as a special technology that needs enterprise-scale exploration to come up with a useful application? And why not take the alternative approach of identifying the subset of people who have indeed found solid uses and spread their best practices around?
The top-down approach to encouraging (mandating?) AI usage strikes me as infantilizing to the workers, who are perfectly capable of choosing which tools they use and when.
Management loves numbers because they’re the only things you can objectively compare as X > Y.
It makes for pretty charts, extrapolations, and projections.
It doesn’t matter if the numbers are not particularly correct. As long as the data gathering step can be justified it’ll do. Though bonus points if making the number bigger is a good thing (v.s. tracking something like number of sev 1 issues).
Yes, but also because management is largely unqualified to be managing the stuff they are hired for. So they regress to numbers because they otherwise cannot participate in anything technical.
This is Matt Garman, the ultimate MBA. Bonus for sure tied to tokens-per-quarter, which is the 2026 equivalent of measuring engineers by lines of code...
This why AWS is bleeding good engineers for years. What is left is starting to look like Boeing post McDonnell merger...
They took out a quarter of their documentation page limited real estate, with AI doc shorts nobody asked for, nobody needs, and cant disable.
Or maybe they plan to review how effective high usage engineers have been next cycle and the tokenmaxxers will get bit in the ass when they have little to show for all their wasted tokens? Performance metrics can, and do, change on a dime and tokenmaxxing seems short sighted when management can look at old logs.
> whoever spent $600 on Anthropic last night, great job leveraging Al!
But to the person who spent $23 on Uber Eats please remember our limit for food is $20 per meal
Senior management let go our localisation staff. Now they want us to use AI to translate. They still want manual review.
We use Github Copilot at work, we get a measly 300 requests with the budget to go over if necessary. Opus 4.7 or GPT 5.5 would eat all of those up in a day. Are we supposed to be using more than the allotted amount, do management see that as a good thing. Or is it best to stick within the allocated amount. Who knows? Management are playing games everywhere it seems.
Requests are such a weird metric. We have a token limit via Copilot (unless I'm misunderstanding our setup), and most of my "features" burn 1 to 2% of my token limit per month on 4.7. But I don't admin our plan, and I'm unsure what we actually git. Vscode just gives me a percentage of tokens remaining metric.
One of the weirder things about all this is how arbitrary and non objective the billing structure seems. One of the reasons I'm happy to use it at work, but won't ever personally subscribe. It's so opaque.
It's just not with AI though. It's who they get their advise from. One of my friend was cribbing to me about his company management - apparently someone in management discovered that PostgresDB is a real good database and free, and so they authorised the IT department to migrate their application from Oracle Cloud to PostgresDB as it will "save a lot of money" (true, but...). However, they aren't willing to shell out for the commercial solutions (like EnterpriseDB, which would be still a lot cheaper than Oracle), and are insisting that the team also recreate "all and every" feature that Oracle DB has and is used by their application, but is lacking in PostgresDB - after all, "If Oracle can do it, why can't you!?".
We've raised, trained, hired and promoted generations of business people who push utter nonsense, understand nothing but optimizing for bad metrics, and orient solely around short term results. It's hard to look beyond modern corporate America when looking for causes of the fall in our living standards. This AI tokenmaxxing nonsense is just another rung on the same ladder to hell we've been on for decades.
How you burn 300 requests in a day? From my Copilot usage Opus consumes surprisingly few requests to do a lot of stuff. It isn’t paying by token but instead by prompt or something.
I guess you need automation for that. Run claude with cron to find fulnerabilities, suggest and implement improvements, automatically dig through backlog
I work at Amazon (standard disclaimer: just sharing my own experience, not an official spokesperson, etc.)
I can't say that this isn't happening, but at least the parts of the company I get visibility into, what the article describes isn't my experience. There is a lot of interest in using GenAI, but people are mostly getting kudos around creative uses for GenAI, not just for raw amount of tokens. For most scaled GenAI efforts, there is a lot of focus on output metrics (metrics like accuracy, number of findings, number of things fixed, and so on).
Amazon is a massive company, your single experience is worse than an anecdote because there is no way verify.
What we can verify is how how Amazon already treats workers, they will surveil anyone within their systems regardless of the futility of said surveillance. Why are we suppose to not believe them using LLM systems as a means to further control their expensive employees from unionizing or seeking out solidarity with fellow workers? All LLMs do is enable tyrannical managers more power to hold over other workers, said workers are forced to engage in self alienation for fear of losing your job or forcing to do meaningless work as that is what's being tracked (and what LLMs excel at producing).
Hardly a good proposition for any worker.
I'm sorry but I fully do not believe you. This is a company that fires workers for taking too long of a bathroom break where said workers piss in bottles for fear of getting fired and you're going "hey guys, it's not too bad. Only some workers get whipped, others don't!"
I'm surprised how few comments are written with the prior that Amazon managers aren't stupid or uninformed about how incentives work.
My guess would be that someone created the leaderboard without a lot of consultation with managers, and that some employees feel a competitive urge to try to "win" the leaderboard by burning tokens.
It is damn fascinating to see just how many (big, serious) organizations are creating unnecessary internal strife over this.
One of my favorite heuristics/quotes applies here: "no matter how good the strategy, occasionally consider the result."
Want to know if AI is working for your org? Ask yourself/employees to "show me the result." That requires judgment and taste (is the result something of value, or just the appearance of work having been done), but it will also save you a ton of stress and disappointment later.
I don't think you need to win this, you just need to not be near the bottom of the board. But just in case, I spam tokens like it's the Chuck E Cheese roulette game.
I was thinking about this recently. I tend to run my AI at low context because the documentation states that they degrade with higher context usage.
However I see tons of people on LinkedIn with ways of backing up context, not wanting to lose context, etc.
This seems like another way the system is being misused. Higher context usage also uses more tokens. I suspect you get worse (and slower) output too than a dense detailed context.
I think there are two motivations that get blurred pretty quickly:
a) you find a particular context that executes well and want to preserve parts of it or not have to repeat explanations
b) you want to continue a session so you don't have to rebuild the context from scratch
I think A is something where it's totally reasonable to preserve pieces as part of like a prompt library or equivalent, or directory-specific agent files, that kind of thing.
I think B is much more likely to lead to problems if you do it over a long time, but it can be pretty useful for getting the last drop of juice out of the metaphorical orange.
I think the antipattern (that I've done myself, admittedly) is swapping between different restored contexts for different tasks or roles - at that point you should be either converting it to more durable documentation if warranted, or curating it more specifically than "restore the entire context" even if it's just one-off.
I think the answer for both cases is supposed to be finishing a "good" session with "based on what you've learned about this project, please update the CLAUDE.md/AGENTS.md/README.md files."
Ideally that replaces the back and forth cycle of it's this, no it's that, it's that for reasons XYZ with a single ingestible blob that gets the agent up to speed.
I've actually had mixed results with that without some manual curation - sometimes by the time a session has gone on for a while (heaven forbid it go through multiple compactions), the agent has so much extraneous/incorrect context for docs that it can't write documentation effectively.
Sometimes it's better to dump context incrementally, reinitialize the agent with a subset of the context, or manually prime it, then ask it to write documentation as a focused task.
I think the more you anthropomorphize it the more it feels like "but I don't want to have to start all over getting it up to speed, this instance already knows all the important stuff."
If every exchange is treated as an independent query/response then it's much easier to see how cutting out the fluff using a combination of its summaries and your own helps stay focused.
People who don't code(management, leadership) think AI will 10x the company but it's really a 40-60% boost. But engineers have to feign adopting this tools in fear of layoffs
Where? What industry, what kind of projects? The only one where I can imagine it to be true is vulnerability research, and I imagine all the low-hanging fruit to be picked soon
It will spin up a boilerplate uboot or BSP config no problem. I still go in and manually check and add peripherals, but opus 4.7 is terrifyingly smart.
Need to modify or add a new peripheral, it's there no problem. Or in a bare metal project, I can point it at an STM32 cubemx starter repo and ask for a feature (set up the ADC on pins 4 and 7, ask me for parameters) and it's just done. I do in a day what would probably take me 2.
It doesn't help me with reviewing others' work, or planning (I maintain that these are manual tasks). So yeah, I agree with the 40-60%. The parts of my job it helps, it really helps.
Yeah, the industry has no issues selling $5 bills for $1. Why is this a good thing for society again? That the public subsidizes VC to no shared gains?
Yesterday, I had my first experience of a mid-level dev stuck on a problem, coming to me with Codex and Copilot summaries of what those tools thought the problem was, which turned out to be completely off-base.
Codex was pretty sure something was wrong with the response object being returned by the endpoint in question. It turned out there was a conversion method applied to the endpoint response, which mutated its input. This method had been running w/o problems for a while, until the dev put it in a useEffect. At this point, React dev mode's policy of rendering everything twice kicked in, which caused the second pass through the conversion method to fail on the now-mutated input object.
Codex never even hinted that the conversion method mutating the input could be a problem, nor anything about React dev mode rendering everything twice (specifically to catch problems like this). Apparently, neither of those came up much in its training data.
My point is that this dev seems to have lost, in a few short months of writing everything with Codex, the ability to trace an error from its source (the error trace was being swallowed in a Codex-written catch block that spit out a generic error message). He was completely stuck and just kept doubling down on trying to get Codex to solve the problem, even checking with Copilot as a backup. I'm not optimistic about where this is headed.
Yes, eventually. Largely because he would have written all the code that got to that point and had a mental model of the entire flow instead of it being a gray box.
the new bottleneck for development at work is code reviews. devs are creating whole features that would take months in only a couple weeks, but code reviewing that is a slow, painful process
This is why I'm not that excited about vibe coding. The bottleneck has always been understanding what the heck is going on.
In my view you should 1) use AI as a tool to help you learn and 2) write boilerplate you could have easily written yourself. Getting it to think for you is counterproductive (at least until it replaces us entirely).
When I was at Amazon, I suggested that promotion to L7 people manager should require that reverse tattooed on your forehead so that you saw it every day. Every time some mandate would come down from on high, it was clear that nobody had thought of the second order effects, malicious compliance, or just outright gaming.
I have mixed thoughts on this. These thoughts are my own. On the one hand, it’s objectively silly to pretend like we’ve solved the age old problem of measuring developer productivity. Metric-obsessed leadership can also be intolerable, counterproductive, and it’s a good way to paint yourself into a corner undervaluing your best talent and overvaluing your mediocre talent.
That said, I’m kind of having a blast using CC in corporate with all the connectors available at our disposal, and I baffled how little some of my coworkers know about what’s available and what the capabilities are. So it’s clear that perhaps some encouragement is prudent for those who are slower to embrace new technologies, but I’m not sure tokencounting and tokenmaxing are the answer.
Not OP, but within Amazon we have pretty good connectors around integrating with our task system (so you can pretty easily ask your GenAI tool "look up the next item in our sprint board, let me know if you have any clarifying questions, but otherwise start implementing it"). We have decent integration with internal wiki and search systems, so it's easier now to figure out the best Amazon way to do some coding task. And Amazon being a big doc-writing company, there are lots of great tools for helping improve all phases of writing.
Yes, we can crawl our entire internal documentation via LLM. Want to know if someone is already working in the space of your latest idea? Ask Claude, it hits the internal search APIs and finds docs and references directly relevant to your query. There are a lot of separate document stores so this took a lot of effort previously. I can also query Slack, Outlook, etc. I don’t understand the cynicism in your comment.
The most important skill is to not stand out of the crowd. This is how you survive in the Soviet Union, in the army, and clearly also at tech companies.
Corporate emails asking why are you not using the <insert-llm> paid plan ??? came very very rapidly. So naturally, everybody started to use it blindly so that the dashboard metrics are all high.
I, too, can easily use more tokens to achieve the same task. I can give worse prompts. I can fail to make it clear to the tools where to find the information they need. I can ask them to think hard when the don’t need to ask tell them not to think when they do need to. I can give vague, open ended instructions. I can generate code that sucks and throw it away.
Even if I'm in the middle of using the AI seriously but then want to rename a variable, I can't do that myself because it'll confuse the AI, so I'll tell it to rename. That seems pretty wasteful.
I wish I could do some tokenmaxxing at my company. The only plan available is maxed out for the month after a few days of serious work, but the AI “experts” are declaring that nobody needs that much. It’s really frustrating to constantly have to juggle quota and lower models. All this while the declared goal is to reach 50% of code written by AI.
Hunger games in the age of AI - eliminate/automate your colleague's job, until a single software engineer is left (or two if aristocrats will see it as a good PR).
Each day I send the AI on a fruitless mission like "summarize the entire codebase" while I do my actual work, which involves actually using the AI for real work. Wish I could disable the token cache to make it spend more.
This kind of thing is totally fine if it's being done (it's believable because Meta internally incentivized tokenmaxxing). When you're trying to change the behavior of a large number of people, only blunt instruments are available if you want to get quick outcomes. The edge cases where people Goodhart very hard are all right. You can just human-in-the-loop them away. The opportunity cost for most organizations of not moving to use AI tools as productivity enhancers is currently gauged by them (rightfully, in my opinion) to be too high to allow for osmotic adoption.
Most people look at sea changes come and go. They all have a story of how they "could have bought Bitcoin when it was $100" or whatever. In an org, you don't want to have the story of "we could have done that when nobody else had", so you incentivize adoption of the tool as hard as possible and hope that dipping feet in the water makes people want to swim. If you don't already have a culture of early adoption (and no large company can) then you have to use blunt incentives. I don't think anyone has demonstrated otherwise.
At least for some people I know it’s not necessarily because there’s pressure from leadership, but because it’s funny that the org spends like $15,000/mo writing HP fanfic or whatever
> They said the move reflected pressure to adopt the technology after Amazon introduced targets for more than 80 percent of developers to use AI each week, and earlier this year began tracking AI token consumption on internal leader boards.
This measuring of tokenmaxxing as a proxy for something beneficial to the company has got to be the single dumbest thing I have ever heard of in my entire software career.
It would be like some company in the dot com era measuring employee's internet download traffic as a proxy for productivity or internet-pilledness.
Why not just reward employees based on who's submit the largest expenses claims? That might have some correlation to work too, right ?!
In the corporate world it's impossible for any one person to tell what's going on across multiple domains due to the complexity. If I tell you the Zorbulon API is creating 30% more flargs (which is critical for Twiddle operation), I often just have to take your word for it.
Hell, I'm in the bowels of Google as an IC and it's hard to understand what adjacent teams are doing. Even harder for management that never gets their hands on anything.
So while you know engineers are probably bullshitting you with fake work, you can at least turn around and tell your supervisor the numbers. It's all a game of plausible deniability.
Another stupid meme-latching name. Don't normalize these *maxxing nonsense words and just use plain language. Let's see, maybe just say they were optimizing for token count?
Measuring token usage as a productivity metric is like measuring keystrokes. Don't mind me, just over here rolling my face on the keyboard for an hour so I can take Friday off...
...except each keystroke has an associated cost, the sum of which may equal or exceed my salary.
What's nuts is how many intelligent people— people who would say "of course 'LOC written' is a terrible measure of developer productivity, of course only a dysfunctional company run by morons would do that"— have immediately bought into this. Amazon has token use mandates, I've heard Google has token use "leaderboards", friends at startups say they all get graded on tokens used. It's like watching your sensible, levelheaded friend go completely off the rails; collective madness.
Some people respond to incentives. The rest of us are just trying to do our jobs and will probably be fired and then later consumed by the basilisk. We are living in an age of extremophiles.
I think that although we wish to consider ourselves as smart and really intelligent but we run on biological machines and clocks which evolutionary have not much of a difference since 1518 or even the times when we used to hunt and forage for that matter.
Think worst place I worked, you had to install an app like Time Bro and you had to account for all 8 hours of the day, some app logged per minute/hour.
This reads more like it's a single employees gripe than a real thing that's happening. They're not using the metrics in performance reviews, and it's a new AI tool that AWS probably wants legitimate usage data out of.
That said, if you can't figure out how to use AI in a software job you should look into it. Not using AI at this point is a lot like not using CAD as an architect.
It is being used in performance reviews, source: recent Amazon SWE.
They also use a bunch of dumb metrics like, total PRs submitted, total comments made on PRs, etc. To the point that, there are multiple heavily used internal tools to game these metrics. Eg, auto-comment LGTM on any approved PR. Thus, making the metrics even worse than they would have been prior.
> Amazon has told employees that the AI token statistics would not be used in performance evaluations.
> Managers are discouraged from using token use to measure performance, according to a person familiar with the matter.
Like CAD and architects, if you're not using LLM's while coding it's an issue, but Amazon is very clear that this isn't an official metric. I would believe managers know how many tokens you're using, but it sounds like they just interviewed a disgruntled employee who didn't like AI and published it.
I think it’s real. I’m at a huge SV tech company and at least half the people here are “token maxing”.
AI is genuinely useful for many tasks. But 2x or greater business value from engineering orgs isn’t it. And even if it was business are terrible at measuring value added on an individual basis.
What they can measure though is token use. I’ve heard the same thing from other large companies my friends work for.
It’s bad enough that I’ve moved a significant amount of money out of US large-cap stocks.
"They're not using the metrics in performance reviews" means almost nothing. It doesn't mean managers at every level are not frequently looking at those numbers. Anyone from Amazon will tell you how much "hint" they get from management about using those tools.
Amazon has far more roles than just software. PMs, FC area managers, managers - if your job involves writing anything you're expected to be using AI in some capacity.
I have been not using AI since the beginning and nothing has changed for me. I have only watched my coworkers and the industry get dimmer and get faster at getting dimmer. I have witnessed professionals become total amateurs and form “well the AI generated this unreviewed report” as their basis of knowledge.
No thanks I’ll just watch y’all slip down the slope.
> That said, if you can't figure out how to use AI in a software job you should look into it. Not using AI at this point is a lot like not using CAD as an architect.
When LLMs are capable of actually doing a good job, then it might be like that. We are not there yet, and we may never be.
This makes me think of the tulip bubble. Using AI as much as possible just so people think you are productive is like buying tulips so that people think you're affluent.
The fact that management signed off on measuring AI use through token usage shows how incompetent management really is, including in allegedly technical conmpanies like Amazon. Tokenmaxxing was an entirely expected and rational response. IOW You measure employees in stupid ways, you're going to get stupid behaviour as a consequence.
Depends on what they're trying to incentivise.
It's quite possible they aren't trying to measure performance but are literally just trying to increase token consumption to feed the bubble and hype.
Plus pressure employees may find new unique use cases for AI.
It's like if your goal is inflation, you give out tons of money and as long as its spent, you achieve your goal.
I would guess they are trying to maximize training data
If I was being rewarded for using more tokens, I would feed LLM output back into the model. That's probably not very useful training data.
One argument I have heard in favour of this is that management knew this would be a side effect, but that it's more important to have people engage with AI as much as possible simply to explore what is actually possible. You are effectively knowingly wasting money in the expectation that you might learn something useful that will be more valuable in the long run.
My questions for that approach are: Why treat AI as a special technology that needs enterprise-scale exploration to come up with a useful application? And why not take the alternative approach of identifying the subset of people who have indeed found solid uses and spread their best practices around?
The top-down approach to encouraging (mandating?) AI usage strikes me as infantilizing to the workers, who are perfectly capable of choosing which tools they use and when.
Management loves numbers because they’re the only things you can objectively compare as X > Y.
It makes for pretty charts, extrapolations, and projections.
It doesn’t matter if the numbers are not particularly correct. As long as the data gathering step can be justified it’ll do. Though bonus points if making the number bigger is a good thing (v.s. tracking something like number of sev 1 issues).
Yes, but also because management is largely unqualified to be managing the stuff they are hired for. So they regress to numbers because they otherwise cannot participate in anything technical.
My current job is doing the exact same thing. My manager even showed me a tool with graphs showing token use and related metrics.
This is Matt Garman, the ultimate MBA. Bonus for sure tied to tokens-per-quarter, which is the 2026 equivalent of measuring engineers by lines of code...
This why AWS is bleeding good engineers for years. What is left is starting to look like Boeing post McDonnell merger...
They took out a quarter of their documentation page limited real estate, with AI doc shorts nobody asked for, nobody needs, and cant disable.
Goodhart's law in action.
Or maybe they plan to review how effective high usage engineers have been next cycle and the tokenmaxxers will get bit in the ass when they have little to show for all their wasted tokens? Performance metrics can, and do, change on a dime and tokenmaxxing seems short sighted when management can look at old logs.
Saw a good joke on twitter about it. Something like:
"You spent $23, over the $20 food limit. Be more careful next time. You spent $600 on tokens, $200 more than the average. Congratulations!"
https://x.com/vasuman/status/2053956365052240263
> whoever spent $600 on Anthropic last night, great job leveraging Al! But to the person who spent $23 on Uber Eats please remember our limit for food is $20 per meal
I swear the industry is being Garry Tanned.
Senior management let go our localisation staff. Now they want us to use AI to translate. They still want manual review.
We use Github Copilot at work, we get a measly 300 requests with the budget to go over if necessary. Opus 4.7 or GPT 5.5 would eat all of those up in a day. Are we supposed to be using more than the allotted amount, do management see that as a good thing. Or is it best to stick within the allocated amount. Who knows? Management are playing games everywhere it seems.
Requests are such a weird metric. We have a token limit via Copilot (unless I'm misunderstanding our setup), and most of my "features" burn 1 to 2% of my token limit per month on 4.7. But I don't admin our plan, and I'm unsure what we actually git. Vscode just gives me a percentage of tokens remaining metric.
One of the weirder things about all this is how arbitrary and non objective the billing structure seems. One of the reasons I'm happy to use it at work, but won't ever personally subscribe. It's so opaque.
It's just not with AI though. It's who they get their advise from. One of my friend was cribbing to me about his company management - apparently someone in management discovered that PostgresDB is a real good database and free, and so they authorised the IT department to migrate their application from Oracle Cloud to PostgresDB as it will "save a lot of money" (true, but...). However, they aren't willing to shell out for the commercial solutions (like EnterpriseDB, which would be still a lot cheaper than Oracle), and are insisting that the team also recreate "all and every" feature that Oracle DB has and is used by their application, but is lacking in PostgresDB - after all, "If Oracle can do it, why can't you!?".
Wow... How (in)competent is his management??? "If Oracle can do it"... in 25 years with 1k devs...
We've raised, trained, hired and promoted generations of business people who push utter nonsense, understand nothing but optimizing for bad metrics, and orient solely around short term results. It's hard to look beyond modern corporate America when looking for causes of the fall in our living standards. This AI tokenmaxxing nonsense is just another rung on the same ladder to hell we've been on for decades.
How you burn 300 requests in a day? From my Copilot usage Opus consumes surprisingly few requests to do a lot of stuff. It isn’t paying by token but instead by prompt or something.
I guess you need automation for that. Run claude with cron to find fulnerabilities, suggest and implement improvements, automatically dig through backlog
300 prompts in a day isn't that unreasonable to achieve on a heavy day? And Opus has a significant multiplier as well
Opus 4.7 has a 7.5x multiplier when it's used from Copilot. Falling back to 4.6 it's only 3.5x
If you are using subagents for asynchronous work, you can burn through 300 requests in a workday easily.
I work at Amazon (standard disclaimer: just sharing my own experience, not an official spokesperson, etc.)
I can't say that this isn't happening, but at least the parts of the company I get visibility into, what the article describes isn't my experience. There is a lot of interest in using GenAI, but people are mostly getting kudos around creative uses for GenAI, not just for raw amount of tokens. For most scaled GenAI efforts, there is a lot of focus on output metrics (metrics like accuracy, number of findings, number of things fixed, and so on).
Amazon is a massive company, your single experience is worse than an anecdote because there is no way verify.
What we can verify is how how Amazon already treats workers, they will surveil anyone within their systems regardless of the futility of said surveillance. Why are we suppose to not believe them using LLM systems as a means to further control their expensive employees from unionizing or seeking out solidarity with fellow workers? All LLMs do is enable tyrannical managers more power to hold over other workers, said workers are forced to engage in self alienation for fear of losing your job or forcing to do meaningless work as that is what's being tracked (and what LLMs excel at producing).
Hardly a good proposition for any worker.
I'm sorry but I fully do not believe you. This is a company that fires workers for taking too long of a bathroom break where said workers piss in bottles for fear of getting fired and you're going "hey guys, it's not too bad. Only some workers get whipped, others don't!"
Thanks for the inside insight.
I'm surprised how few comments are written with the prior that Amazon managers aren't stupid or uninformed about how incentives work.
My guess would be that someone created the leaderboard without a lot of consultation with managers, and that some employees feel a competitive urge to try to "win" the leaderboard by burning tokens.
Your comment is the equivalent of stating, that Jeff Bezos and Anuy Jassy, do not really know their employees are carrying about urine bottles.
> There is a lot of interest in using GenAI, but people are mostly getting kudos around creative uses for GenAI,
LOL, I'd imagine even Amazon HR would be little restraint in showering such praise.
It is damn fascinating to see just how many (big, serious) organizations are creating unnecessary internal strife over this.
One of my favorite heuristics/quotes applies here: "no matter how good the strategy, occasionally consider the result."
Want to know if AI is working for your org? Ask yourself/employees to "show me the result." That requires judgment and taste (is the result something of value, or just the appearance of work having been done), but it will also save you a ton of stress and disappointment later.
Once you have a score, you have a game. Once you have a game, people will do whatever it takes to win.
I don't think you need to win this, you just need to not be near the bottom of the board. But just in case, I spam tokens like it's the Chuck E Cheese roulette game.
I was thinking about this recently. I tend to run my AI at low context because the documentation states that they degrade with higher context usage.
However I see tons of people on LinkedIn with ways of backing up context, not wanting to lose context, etc.
This seems like another way the system is being misused. Higher context usage also uses more tokens. I suspect you get worse (and slower) output too than a dense detailed context.
I think there are two motivations that get blurred pretty quickly:
a) you find a particular context that executes well and want to preserve parts of it or not have to repeat explanations
b) you want to continue a session so you don't have to rebuild the context from scratch
I think A is something where it's totally reasonable to preserve pieces as part of like a prompt library or equivalent, or directory-specific agent files, that kind of thing.
I think B is much more likely to lead to problems if you do it over a long time, but it can be pretty useful for getting the last drop of juice out of the metaphorical orange.
I think the antipattern (that I've done myself, admittedly) is swapping between different restored contexts for different tasks or roles - at that point you should be either converting it to more durable documentation if warranted, or curating it more specifically than "restore the entire context" even if it's just one-off.
I think the answer for both cases is supposed to be finishing a "good" session with "based on what you've learned about this project, please update the CLAUDE.md/AGENTS.md/README.md files."
Ideally that replaces the back and forth cycle of it's this, no it's that, it's that for reasons XYZ with a single ingestible blob that gets the agent up to speed.
I've actually had mixed results with that without some manual curation - sometimes by the time a session has gone on for a while (heaven forbid it go through multiple compactions), the agent has so much extraneous/incorrect context for docs that it can't write documentation effectively.
Sometimes it's better to dump context incrementally, reinitialize the agent with a subset of the context, or manually prime it, then ask it to write documentation as a focused task.
Yeah, I also agree that A is good in many cases.
I think the more you anthropomorphize it the more it feels like "but I don't want to have to start all over getting it up to speed, this instance already knows all the important stuff."
If every exchange is treated as an independent query/response then it's much easier to see how cutting out the fluff using a combination of its summaries and your own helps stay focused.
People who don't code(management, leadership) think AI will 10x the company but it's really a 40-60% boost. But engineers have to feign adopting this tools in fear of layoffs
> 40-60% boost
Where? What industry, what kind of projects? The only one where I can imagine it to be true is vulnerability research, and I imagine all the low-hanging fruit to be picked soon
Mine, easily. Senior (near staff) level embedded engineering.
It will spin up a boilerplate uboot or BSP config no problem. I still go in and manually check and add peripherals, but opus 4.7 is terrifyingly smart.
Need to modify or add a new peripheral, it's there no problem. Or in a bare metal project, I can point it at an STM32 cubemx starter repo and ask for a feature (set up the ADC on pins 4 and 7, ask me for parameters) and it's just done. I do in a day what would probably take me 2.
It doesn't help me with reviewing others' work, or planning (I maintain that these are manual tasks). So yeah, I agree with the 40-60%. The parts of my job it helps, it really helps.
> I can point it at an STM32 cubemx starter repo and ask for a feature
My experience is it will attempt read from the wrong memory block resulting in garbadge. But that's a while ago so maybe LLMs have gotten better.
The AI labs have all released at least 3 new models each since december, things move very quickly.
Yeah, the industry has no issues selling $5 bills for $1. Why is this a good thing for society again? That the public subsidizes VC to no shared gains?
Any typical web backend or frontend kind of thing. So like, not systems code.
40% boost for smart engineers, for now.
People churning out slop is slowing me down and the full effects of it won't be felt for a while.
Yesterday, I had my first experience of a mid-level dev stuck on a problem, coming to me with Codex and Copilot summaries of what those tools thought the problem was, which turned out to be completely off-base.
Codex was pretty sure something was wrong with the response object being returned by the endpoint in question. It turned out there was a conversion method applied to the endpoint response, which mutated its input. This method had been running w/o problems for a while, until the dev put it in a useEffect. At this point, React dev mode's policy of rendering everything twice kicked in, which caused the second pass through the conversion method to fail on the now-mutated input object.
Codex never even hinted that the conversion method mutating the input could be a problem, nor anything about React dev mode rendering everything twice (specifically to catch problems like this). Apparently, neither of those came up much in its training data.
My point is that this dev seems to have lost, in a few short months of writing everything with Codex, the ability to trace an error from its source (the error trace was being swallowed in a Codex-written catch block that spit out a generic error message). He was completely stuck and just kept doubling down on trying to get Codex to solve the problem, even checking with Copilot as a backup. I'm not optimistic about where this is headed.
Are you sure he was capable of debugging it before?
Yes, eventually. Largely because he would have written all the code that got to that point and had a mental model of the entire flow instead of it being a gray box.
the new bottleneck for development at work is code reviews. devs are creating whole features that would take months in only a couple weeks, but code reviewing that is a slow, painful process
The bottleneck at my work for development was already code review before LLMs.
This is why I'm not that excited about vibe coding. The bottleneck has always been understanding what the heck is going on.
In my view you should 1) use AI as a tool to help you learn and 2) write boilerplate you could have easily written yourself. Getting it to think for you is counterproductive (at least until it replaces us entirely).
Its not really a 60%. It accelerates a lot code creation. Save some time on admin tasks. That is it.
Amazon is big and inconsistent enough that "somewhere in Amazon, <XYZ> is occurring" is statistically true, no matter how nutso-sounding your <XYZ>.
“Show me the incentive and I'll show you the outcome.”
― Charlie Munger
Would that make chasing perverse outcomes in the corporate environment the Munger Games?
When I was at Amazon, I suggested that promotion to L7 people manager should require that reverse tattooed on your forehead so that you saw it every day. Every time some mandate would come down from on high, it was clear that nobody had thought of the second order effects, malicious compliance, or just outright gaming.
I joked about this on HN a few weeks ago and I find it funny that we ended up here already. Goodhart's Law in action.
I have mixed thoughts on this. These thoughts are my own. On the one hand, it’s objectively silly to pretend like we’ve solved the age old problem of measuring developer productivity. Metric-obsessed leadership can also be intolerable, counterproductive, and it’s a good way to paint yourself into a corner undervaluing your best talent and overvaluing your mediocre talent.
That said, I’m kind of having a blast using CC in corporate with all the connectors available at our disposal, and I baffled how little some of my coworkers know about what’s available and what the capabilities are. So it’s clear that perhaps some encouragement is prudent for those who are slower to embrace new technologies, but I’m not sure tokencounting and tokenmaxing are the answer.
Could you list us some of the capabilities you use that bring value besides “summarize my email”
Not OP, but within Amazon we have pretty good connectors around integrating with our task system (so you can pretty easily ask your GenAI tool "look up the next item in our sprint board, let me know if you have any clarifying questions, but otherwise start implementing it"). We have decent integration with internal wiki and search systems, so it's easier now to figure out the best Amazon way to do some coding task. And Amazon being a big doc-writing company, there are lots of great tools for helping improve all phases of writing.
Yes, we can crawl our entire internal documentation via LLM. Want to know if someone is already working in the space of your latest idea? Ask Claude, it hits the internal search APIs and finds docs and references directly relevant to your query. There are a lot of separate document stores so this took a lot of effort previously. I can also query Slack, Outlook, etc. I don’t understand the cynicism in your comment.
That is a summarize my wiki. Nice search feature.
I can tell they are surely not the only ones.
Everyone I talk to has nowadays KPIs tied to AI usage on their performance evaluation.
The most important skill is to not stand out of the crowd. This is how you survive in the Soviet Union, in the army, and clearly also at tech companies.
Quite a good point.
Corporate emails asking why are you not using the <insert-llm> paid plan ??? came very very rapidly. So naturally, everybody started to use it blindly so that the dashboard metrics are all high.
It's astonishing how society forgets.
I, too, can easily use more tokens to achieve the same task. I can give worse prompts. I can fail to make it clear to the tools where to find the information they need. I can ask them to think hard when the don’t need to ask tell them not to think when they do need to. I can give vague, open ended instructions. I can generate code that sucks and throw it away.
If I do all of this, do I get a promotion?
Even if I'm in the middle of using the AI seriously but then want to rename a variable, I can't do that myself because it'll confuse the AI, so I'll tell it to rename. That seems pretty wasteful.
I wish I could do some tokenmaxxing at my company. The only plan available is maxed out for the month after a few days of serious work, but the AI “experts” are declaring that nobody needs that much. It’s really frustrating to constantly have to juggle quota and lower models. All this while the declared goal is to reach 50% of code written by AI.
When did FT become Business Insider?
I have an FT subscription and they keep moving toward this kind of narrative first reporting to get clicks. It’s no longer a believable paper.
Business Insider would say "tokenmaxxing is pure promotional intelligence"
Hunger games in the age of AI - eliminate/automate your colleague's job, until a single software engineer is left (or two if aristocrats will see it as a good PR).
You can use Codex and Claude code for most of the tasks that you would manually do
Filing JIRA tickets, updates. Opening PRs, having AI review PRs. This will all use tokens.
No need to tokenmaxx, you will end up burning tokens with just regular AI usage
Each day I send the AI on a fruitless mission like "summarize the entire codebase" while I do my actual work, which involves actually using the AI for real work. Wish I could disable the token cache to make it spend more.
This kind of thing is totally fine if it's being done (it's believable because Meta internally incentivized tokenmaxxing). When you're trying to change the behavior of a large number of people, only blunt instruments are available if you want to get quick outcomes. The edge cases where people Goodhart very hard are all right. You can just human-in-the-loop them away. The opportunity cost for most organizations of not moving to use AI tools as productivity enhancers is currently gauged by them (rightfully, in my opinion) to be too high to allow for osmotic adoption.
Most people look at sea changes come and go. They all have a story of how they "could have bought Bitcoin when it was $100" or whatever. In an org, you don't want to have the story of "we could have done that when nobody else had", so you incentivize adoption of the tool as hard as possible and hope that dipping feet in the water makes people want to swim. If you don't already have a culture of early adoption (and no large company can) then you have to use blunt incentives. I don't think anyone has demonstrated otherwise.
Measuring productivity via tokens is the modern day equivalent to doing it via number of commits or LOC
At least for some people I know it’s not necessarily because there’s pressure from leadership, but because it’s funny that the org spends like $15,000/mo writing HP fanfic or whatever
> They said the move reflected pressure to adopt the technology after Amazon introduced targets for more than 80 percent of developers to use AI each week, and earlier this year began tracking AI token consumption on internal leader boards.
This measuring of tokenmaxxing as a proxy for something beneficial to the company has got to be the single dumbest thing I have ever heard of in my entire software career.
It would be like some company in the dot com era measuring employee's internet download traffic as a proxy for productivity or internet-pilledness.
Why not just reward employees based on who's submit the largest expenses claims? That might have some correlation to work too, right ?!
In the corporate world it's impossible for any one person to tell what's going on across multiple domains due to the complexity. If I tell you the Zorbulon API is creating 30% more flargs (which is critical for Twiddle operation), I often just have to take your word for it.
Hell, I'm in the bowels of Google as an IC and it's hard to understand what adjacent teams are doing. Even harder for management that never gets their hands on anything.
So while you know engineers are probably bullshitting you with fake work, you can at least turn around and tell your supervisor the numbers. It's all a game of plausible deniability.
It's truly bonkers: the reverse of a budget. It is like rewarding the people who spend the most money.
A perfect doomsday machine. Over-using tokens gets your peers laid-off before yourself.
Another stupid meme-latching name. Don't normalize these *maxxing nonsense words and just use plain language. Let's see, maybe just say they were optimizing for token count?
I like it because it highlights the stupidity going on. Bullshit doesn't deserve a respectible name.
Reminds me of the managers that use 'lines of code added' as a metric
Measuring token usage as a productivity metric is like measuring keystrokes. Don't mind me, just over here rolling my face on the keyboard for an hour so I can take Friday off...
...except each keystroke has an associated cost, the sum of which may equal or exceed my salary.
Insert photo of Simpsons drinking bird while homer sleeps here.
What's nuts is how many intelligent people— people who would say "of course 'LOC written' is a terrible measure of developer productivity, of course only a dysfunctional company run by morons would do that"— have immediately bought into this. Amazon has token use mandates, I've heard Google has token use "leaderboards", friends at startups say they all get graded on tokens used. It's like watching your sensible, levelheaded friend go completely off the rails; collective madness.
Some people respond to incentives. The rest of us are just trying to do our jobs and will probably be fired and then later consumed by the basilisk. We are living in an age of extremophiles.
It's a test of practical intelligence.
> collective madness
mass hysteria perhaps?
There used to be a time where people used to die from dancing too much (from my understanding in which hey I can be wrong, I usually am): https://en.wikipedia.org/wiki/Dancing_plague_of_1518
I think that although we wish to consider ourselves as smart and really intelligent but we run on biological machines and clocks which evolutionary have not much of a difference since 1518 or even the times when we used to hunt and forage for that matter.
Seems to be a clear case of Goodhart's Law that states that "when a measure becomes a target, it ceases to be a good measure."
That's true, but I don't know if this one was ever a good measure in the first place.
People use AI differently and they can be equally productive with a variety of token usage quantities.
Also, different kinds of work are differently amenable to using AI.
Measuring tokens used can absolutely be useful; tracking things like cost, compute-demand, usage to negotiate a better contract, and on and on.
Using it to grade people is, err, rather unwise.
I think we've found an extension of Goodhart's law- it makes bad measures even worse.
Can't you just, wire your agent into a Python script and have it infinitely check its own work? That would hit the metrics, but do nothing useful.
Hell, throw a Tarot reading in the middle of the loop so the agent has non-deterministic behavior too.
https://github.com/trailofbits/skills/tree/main/plugins/let-...
Amazon management wants to play five-dimensional chess? Play Balatro instead.
Imagine selling a product where companies are foaming at the mouth to increase their spend and pay you more money
It does not get any better than that
Jensen, Sam, Dario: https://i.imgur.com/AI7rtCY.jpeg
Vibecoded ppt, docs, frontends is an even bigger scam than crypto ever was. Ofc people getting sucked into it
Someone pressuring to do something at work gives off creep vibes.
Is that in the contract to use AI tools? If not, then what are they on about.
They're always pressuring me into "shipping" "features"
"someone pressuring you to do something at work" describes pretty much all jobs
Very very few jobs in the US give you a contract.
Think worst place I worked, you had to install an app like Time Bro and you had to account for all 8 hours of the day, some app logged per minute/hour.
Would rather eat dirt than work at a place like that. Respect.
This reads more like it's a single employees gripe than a real thing that's happening. They're not using the metrics in performance reviews, and it's a new AI tool that AWS probably wants legitimate usage data out of.
That said, if you can't figure out how to use AI in a software job you should look into it. Not using AI at this point is a lot like not using CAD as an architect.
It is being used in performance reviews, source: recent Amazon SWE.
They also use a bunch of dumb metrics like, total PRs submitted, total comments made on PRs, etc. To the point that, there are multiple heavily used internal tools to game these metrics. Eg, auto-comment LGTM on any approved PR. Thus, making the metrics even worse than they would have been prior.
> Amazon has told employees that the AI token statistics would not be used in performance evaluations.
> Managers are discouraged from using token use to measure performance, according to a person familiar with the matter.
Like CAD and architects, if you're not using LLM's while coding it's an issue, but Amazon is very clear that this isn't an official metric. I would believe managers know how many tokens you're using, but it sounds like they just interviewed a disgruntled employee who didn't like AI and published it.
> Not using AI at this point is a lot like not using CAD as an architect.
Does CAD software regularly generate an incorrect design that results in a catastrophic failure of the building?
I think it’s real. I’m at a huge SV tech company and at least half the people here are “token maxing”.
AI is genuinely useful for many tasks. But 2x or greater business value from engineering orgs isn’t it. And even if it was business are terrible at measuring value added on an individual basis.
What they can measure though is token use. I’ve heard the same thing from other large companies my friends work for.
It’s bad enough that I’ve moved a significant amount of money out of US large-cap stocks.
"They're not using the metrics in performance reviews" means almost nothing. It doesn't mean managers at every level are not frequently looking at those numbers. Anyone from Amazon will tell you how much "hint" they get from management about using those tools.
>Not using AI at this point is a lot like not using CAD as an architect.
You should have asked AI to come up with a better analogy.
Amazon has far more roles than just software. PMs, FC area managers, managers - if your job involves writing anything you're expected to be using AI in some capacity.
We can tell they are using AI
I have been not using AI since the beginning and nothing has changed for me. I have only watched my coworkers and the industry get dimmer and get faster at getting dimmer. I have witnessed professionals become total amateurs and form “well the AI generated this unreviewed report” as their basis of knowledge.
No thanks I’ll just watch y’all slip down the slope.
Agreed. AI usage seems to be mostly bragging on HN / LinkedIn
Apparently it's real. Meta has a tokenmaxxing leaderboard too.
"Wow, look at how fast employee # 2 is setting money on fire! Let's promote him!"
> That said, if you can't figure out how to use AI in a software job you should look into it. Not using AI at this point is a lot like not using CAD as an architect.
When LLMs are capable of actually doing a good job, then it might be like that. We are not there yet, and we may never be.
> They're not using the metrics in performance reviews
Heh. No need to be ashamed, I used to believe them when they lied to me like this too!
This makes me think of the tulip bubble. Using AI as much as possible just so people think you are productive is like buying tulips so that people think you're affluent.
A very poor look for management. They don't know what the heck they're doing.
yes
omg