What I find fascinating that there is so little substance in this article about the quality of produced code and the medium. Is the code documented and tested? Is it understandable and extendable? Is it secure? What language, framework, database was used? Author mentions judgement and taste - well, is the code tasteful? Will the model rearchitecture the entire thing if I ask it to add new functionality, spending another 9.5h in tokens?
Anecdote: I fed Fable some models I’ve been hand verifying (basically, I sketch out a scenario for Opus to model, it builds it, I ask it to show me the math, I correct it, we iterate like this, then I double check its code to make sure the math matches the model logic). Fable found almost every error I found, and then had some interesting suggestions for additional variables.
It also burned through my usage quota like a late-90s Hummer.
What are people working on that they see such a substantial difference between Mythos and Opus? I'd say I'm working with advanced stuff and more than often Deepseek is even more than enough. Why is everybody a genius in here?
We see the same thing when new laptops are announced and every employee all of a sudden needs to upgrade, despite the fact that 90% of people would be able to make do with a Macbook Neo.
> Switched to Opus 4.8: Fable 5 has safety measures that flag messages on most cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Send feedback or learn more.
> Again, it wasn’t perfect. As an expert, I was able to spot some errors and omissions (some as a result of the design I had asked for) that I had the AI correct
That's the bit that stuck out to me - that's longer than I would expect to work on a problem in a day or even expect to go back & fix the output of something that has a core reward loop of hours.
My customers are currently clamoring to push down my agent response times from 85 seconds down to below the 20s mark.
At the same time, it is very dissonant to see the industry heading towards hour+ long workflows with an agent.
In Claude's defense (and I cannot believe I'm defending it), I know no single dev who could create what it did (Concord), from a 19-page design document, in 9.5 working hours.
We're gonna go back to the days where our bosses ask why we're just sitting around, but instead of saying "compiling," we'll just say, "waiting for Claude."
This. I get told things like "you can't build all that on your own?" I've had Claude poop out full feature web apps in under 30 minutes, to a spec. Was it perfect? No, but sometimes even in a simple setup phase you can burn 15 minutes to some obscure setup step that's failing. I cannot just code nonstop at 900WPM or whatever ridiculous speed, and poop out an entire full feature web app, with maybe a few bugs here or there. If you can, come show me, I'll gladly have you race against my Claude prompting capabilities.
Will Claude's code be perfect in one shot? Probably not, will it get you 80 to 90% of the way there with your chosen design patterns in under a few hours? Absolutely.
Work duration is also not that valuable of a measure, you're usually better off defining the process yourself in code and having that delegate chunks of work to the models. The only real issue there is that it's harder to take advantage of the providers' subscription discounts, but on the other hand it's easier to do your own model routing, and there's no way I've seen for the normal chatbots to maintain coherence on streams of work measured in days and weeks.
I think we hit the sigmoid back when the QWEN models were released. By properly structuring my project, I can point it at any extension I want and get it going for 30 minutes to extend whatever. It can't effectively do 'god mode' on all the code, but being a mindful observer and code "professional" I don't need more than what a 128GB VRAM needs.
I'm amazed we're so far into SOTA bloat that the chinese will kill once they start etching silicon with these models.
Isn't it common to refer to all software like that? "Let my look at my JIRA", "I can't find anything using my Outlook's search function", "My Powerpoint is acting up today", "My browser just crashed" are all sentences I might say during a normal work day
Would love to see samples of the kinds of prompts you use with both. I sometimes wonder if the specific wording is the secret sauce, I have very few issues with Opus / Claude, but when I try premier GPT models, I get weird output from what I've grown to expect with Claude.
Instead of attacking the author, please respond to the content of the article. That is the HN way, and it leads to more substantive and interesting discussions.
It is not a sponsored article and he writes one of these every time a new model releases. Why would a professor at Wharton need to write sponsored Substack articles.
What I find fascinating that there is so little substance in this article about the quality of produced code and the medium. Is the code documented and tested? Is it understandable and extendable? Is it secure? What language, framework, database was used? Author mentions judgement and taste - well, is the code tasteful? Will the model rearchitecture the entire thing if I ask it to add new functionality, spending another 9.5h in tokens?
Anecdote: I fed Fable some models I’ve been hand verifying (basically, I sketch out a scenario for Opus to model, it builds it, I ask it to show me the math, I correct it, we iterate like this, then I double check its code to make sure the math matches the model logic). Fable found almost every error I found, and then had some interesting suggestions for additional variables.
It also burned through my usage quota like a late-90s Hummer.
now for the best question: whats your ROI here?
What are people working on that they see such a substantial difference between Mythos and Opus? I'd say I'm working with advanced stuff and more than often Deepseek is even more than enough. Why is everybody a genius in here?
We see the same thing when new laptops are announced and every employee all of a sudden needs to upgrade, despite the fact that 90% of people would be able to make do with a Macbook Neo.
What it feels like to work with Fable:
> Switched to Opus 4.8: Fable 5 has safety measures that flag messages on most cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Send feedback or learn more.
> It worked for nine and a half hours.
> Again, it wasn’t perfect. As an expert, I was able to spot some errors and omissions (some as a result of the design I had asked for) that I had the AI correct
That's the bit that stuck out to me - that's longer than I would expect to work on a problem in a day or even expect to go back & fix the output of something that has a core reward loop of hours.
My customers are currently clamoring to push down my agent response times from 85 seconds down to below the 20s mark.
At the same time, it is very dissonant to see the industry heading towards hour+ long workflows with an agent.
In Claude's defense (and I cannot believe I'm defending it), I know no single dev who could create what it did (Concord), from a 19-page design document, in 9.5 working hours.
We're gonna go back to the days where our bosses ask why we're just sitting around, but instead of saying "compiling," we'll just say, "waiting for Claude."
This. I get told things like "you can't build all that on your own?" I've had Claude poop out full feature web apps in under 30 minutes, to a spec. Was it perfect? No, but sometimes even in a simple setup phase you can burn 15 minutes to some obscure setup step that's failing. I cannot just code nonstop at 900WPM or whatever ridiculous speed, and poop out an entire full feature web app, with maybe a few bugs here or there. If you can, come show me, I'll gladly have you race against my Claude prompting capabilities.
Will Claude's code be perfect in one shot? Probably not, will it get you 80 to 90% of the way there with your chosen design patterns in under a few hours? Absolutely.
For the rare uninitiated:
https://xkcd.com/303/
> At the same time, it is very dissonant to see the industry heading towards hour+ long workflows with an agent.
At this point, pay me significantly more, and I'll do it.
Work duration is also not that valuable of a measure, you're usually better off defining the process yourself in code and having that delegate chunks of work to the models. The only real issue there is that it's harder to take advantage of the providers' subscription discounts, but on the other hand it's easier to do your own model routing, and there's no way I've seen for the normal chatbots to maintain coherence on streams of work measured in days and weeks.
I think we hit the sigmoid back when the QWEN models were released. By properly structuring my project, I can point it at any extension I want and get it going for 30 minutes to extend whatever. It can't effectively do 'god mode' on all the code, but being a mindful observer and code "professional" I don't need more than what a 128GB VRAM needs.
I'm amazed we're so far into SOTA bloat that the chinese will kill once they start etching silicon with these models.
My Opus 4.8 regularly works for 10+minutes on a single non-trivial coding request.
Your Opus 4.8? Is it now usual to refer to LLMs like that?
That's pretty tame, if you want to be disturbed check out r/MyBoyfriendIsAI
Isn't it common to refer to all software like that? "Let my look at my JIRA", "I can't find anything using my Outlook's search function", "My Powerpoint is acting up today", "My browser just crashed" are all sentences I might say during a normal work day
Depends on the demographic I think. And also tells you surprisingly much about how the brain of person uttering it works.
There are people that almost feel physical pain if something is unnecessarily incorrect.
+ That if the mental model of something is accurate, it is actually _more_ work to say something that is incorrect than just saying the correct thing.
better than "The JIRA" , or "The Google" or "The Spotify"
You don't have your Opus 4.8 ? I got mine yesterday !
This is what he built:
https://isochronic-passage-chart.netlify.app/
Doesn’t work too well on mobile but looks interesting
would it be possible for mythos to make the space bar scroll the pages on your website properly?
Seems to be hijacked the video of some game they generated. :(
I think Qwen 3.7-Plus is better at reasoning than Mythos, and I've used both for quite a while.
Would love to see samples of the kinds of prompts you use with both. I sometimes wonder if the specific wording is the secret sauce, I have very few issues with Opus / Claude, but when I try premier GPT models, I get weird output from what I've grown to expect with Claude.
Mollick runs the Generative AI Lab at Wharton, with all the corporate sponsors.
He is a professor but sadly also an AI shill. He should switch to advertising washing power.
So...no engagement with the substance? Not even to explain why it is that this is not a useful description or test of capabilities? Ok.
I would like to see it do something useful, like converting pytorch to golang.
Why not get a plan from Anthropic and get that done yourself? Probably is going to cost you as much as a coffee.
Hot damn - is that the floor of what you consider useful?
This newfangled car thing is useless. It can't even properly shoe a horse.
Instead of attacking the author, please respond to the content of the article. That is the HN way, and it leads to more substantive and interesting discussions.
I just can't stand this type of fawning language.
More Mythos Marketing.
The mythos of Mythos is marketing.
[flagged]
It is not a sponsored article and he writes one of these every time a new model releases. Why would a professor at Wharton need to write sponsored Substack articles.
"I don't care who the IRS sends I am not paying taxes!"