It's still very wip, I spent a couple of weekends on it so far, but I'm working on a harness that eschews autonomy and instead aims to work as a pair programming partner. Key to that are distinct "driver" and "navigator" modes, with the capacity to flip between them rapidly.
Love this idea and will be following closely! I've wanted a pair programmer style interaction for a while now. Something closer to VSCode's Copilot inline conversations and FIM, but where it's continuously watching what I'm doing and ruminating on suggestions.
I'm building "workboxes" to work on my startup. It helps me develop features insanely fast.
A workbox is a simple worktree-in-a-sandbox per feature. I have a simple front end where I can launch new workboxes: I input a prompt (a documented grilling session) and it creates a branch, a PR, and starts an opencode coding session on an e2b sandbox based on a custom template with the app's monorepo. Each workbox has a public https endpoint so I can manually test the web app after the coding session is complete. At any point I can either approve the PR, send a follow-up prompt, or connect to the opencode session for more control.
I think my next step is to perform the grilling session inside the front end, currently I perform it in my terminal and then paste in the front end.
Is it similar to how Claude Code Web works? It generates a cloud container and clones your repo, and works on a whatever you want (preferebly something specific), and then it generates a branch and a PR.
Not being able to enter flow state is a very interesting observation. I've felt it too to the extent that I went down a whole new rabbit hole of what it means to be in flow state. Let me know if anybody here wants to know more, happy to post some links.
To answer your question - I discuss the approach with Claude Code (e.g., should I implement my own ACT model in JAX or PyTorch, Python or Rust or Julia, etc.). Then write the initial part of the code myself. Opening up a blank vscode is a simple joy of life I refuse to give up :-) I'll ask Claude for advice if I get stuck, it will helpfully offer to write that code for me, I obstinately decline. Eventually, I'll get bored of some minutiae or other, at which point I'll ask Claude to complete just that part of it.
Ok here are the flow related links. This was about 1.5 years ago when I was trying to figure out burnout and it turned out flow (or lack thereof) was closely related.
* https://youtu.be/VbUFMYs0kXQ?si=xiNw4ZFlla8k-p7w The person who gives this talk (Rian Doris) has a good newsletter that I still read. I just checked their website and it has gone in full commercial mode, so YMMV.
* https://www.ted.com/talks/elizabeth_gilbert_your_elusive_creative_genius
* https://www.amazon.com/dp/0465074871
* https://www.betterup.com/blog/meaning-of-personal-values
I'd be interested in the rabbit hole of flow state. Also with regards to the dopamine rewards of solving a bug as motivation.
Sometimes using a LLM can assist these and sometimes it can feel like cheating myself out of a good thing and I'm not entirely sure where the borders are. It could also be related to a sense of ownership or pride in ones work and seeing the value in doing quality work.
Yes, like many others I've been experimenting a lot. What I've got so far is a harness-of-harnesses - ie, a harness which sits on top of Claude Code, Codex or OpenCode. I still use Claude Code or Codex directly for the initial planning of features, to investigate issues, and for small fixes, but whenever there's something even just a bit complex to do, I use my second-level harness.
Summarizing it a lot, what it does is:
* help you make better plans
* split plans into iterations, in a module-aware way for projects which have strict modularity (for now I'm doing this specifically with TypeScript and dependency cruiser) - this helps a lot when a project becomes complex
* ask an agent to implement an iteration, and then programmatically run a lot of checks after each iteration - not just regression tests, but also checks against project principles and conventions
* when possible, automatically fix deviations; when not possible, raise them to myself for an end-of-plan review
In this way, instead of having to constantly be engaged with the chat interface, with all the shorter or longer wait times which break my flow, I spend a lot of highly focused time during initial planning and final review. A plan implementation can go on for hours, and the various anchoring mechanisms added to the tool keep drift to a minimum.
At some point I'm planning to release this tool as open source. As this is the result of months of trial and errors, dogfooding, and vibecoding on the tool itself, the codebase is chaotic and the UI is still full of experiments I mostly basically abandoned, and I'm not used to releasing stuff in this status. But perhaps, in this brave new world, I should just do it and see what happens?
I think the fundamental aspect of flow is that it requires a high amount of cognitive engagement. Most of the time you're just not getting that from interacting with an LLM because the process is relatively passive. There are also forced breaks while it does its internal CoT which breaks flow.
I think a lot of people get a sort of novelty effect when first interacting with an LLM which can feel superficially like flow, but it's different in that it eventually wanes and what really happens in practice is you're encouraged to disengage and this makes it almost impossible to get into a true flow state.
The risk here I think is that if you get humans disengaging from the task at hand, there's a higher chance of bugs being introduced. You might move slightly faster in the short term but be forced to hit the brakes in the medium/long term.
If you like videos, I saw an interesting video yesterday about systems thinking, software as ecosystem particularly with AI. More of an overview but gives an insight into seeing where we might be able to experiment with different ways Its more focused on teams and companies than individual developers but I think it could be applied to the single dev.
Something I'm thinking about and doing a bit of experimentation with is using LLMs to write specialist higher level code.
Rather than ask them to write web-apps in webby languages with open source frameworks etc, providing a very fixed, on-rails development process where everything is abstracted away. Accept that it'll be less powerful, but take the trade-off that it'll hopefully be faster and produce much more controllable software.
Concrete example, why do we let the LLM choose a database, schema, migration procedure, library, etc. We could decide to only support one database, enforce schema design (such as every table containing access control), enforce a migration process, enforce a library, even do schema design in a fixed config file rather than arbitrary DDL. Same for auth, deployments, even UI.
I'm in the same boat and I'm not a fan on the current way of working of agents, but I think tooling is what needs to catch up.
So, I actually decided to try to tackle it myself and worked some months (full time) on it.
https://beolis.com is the result of that, it's a local cli in a kanban board style with a remote server to keep the team on track (I've been using it myself for some time and actually started to ask some friends to use it just yesterday -- feedback very welcome, I still wanted to do some additional things before asking more people to use it, but oh well, I'm a fan of building in public anyways and it's probably better to have feedback sooner rather than later).
The main point there is that you work mostly in the ticket description (your own spec) and the plan (the spec as the agent sees it, generated with a custom workflow) and then having another custom workflow to implement it (you can choose how you want it -- https://beolis.com/blog/post/custom-coding-workflows has some info on what I'm using myself).
As a result, at least for me, I do spend more time immersed in a flow state (although I'm in that state writing the specs and reviewing code -- although in some cases it's more work to write the spec in a way the agent can work when things get more complicated vs just diving into the code, so, going into "code" mode is something I still have to do, agents are definitely not perfect).
I guess I'm lacking in docs on how to effectively use it. I have plans to create a video next week and post it in the blog, so, if you're interested, keep track of it ;)
Just yesterday I tried to find an annoying and persistent bug in the cummunication between a Lyrion Media Server and my player.
I used Opencode's native Big Pickle AI, and first it was a pain in the back, because it gave me a new code, I had to start the player and test the control in the server's web GUI, report the errors back, and so forth, and it tried a lot, but never found the real cause.
Then I got tired, and told it to use PlayWright to control the browser and test by itself. After some hangs, that I had to stop manually, it did all by itself, and finally fixed the bug.
I had to increase the agents' steps setting in the config, but that was it.
While it was fixing the bug, I surfed the web, and kept an eye on it, but it did everything on it's own. impressive.
I am currently in the process of launching my AI teams platform that I've been working on since at least January. It's https://PersonaStack.ai. I'm doing it without VC money and all by myself. I've used over 110B tokens so far building it.
You get some amazing results with teams of AIs if you do it right. The key is to control behavior with what integrations and responsibilities each agent has. That way they naturally adapt, delegate, fact check each other, and generally act more autonomously.
This is already running the automated news site ainews.personastack.ai complete with social media posts 100% automated.
It also runs the issue triage, coding, reviews, and releases for the Kuberhealthy open source CNCF project, which is another thing of mine.
I don't think the next step is really smarter models. It's how we make the models more effective, and teams, when done right, net the best results I've seen.
Hoping to get noticed here soon, but it's extremely hard to do solo I'm finding.
Very impressive, especially since you did it solo. The website looks great and explains everything in detail.
Can you elaborate more about its development? How much do 110B tokens equate to in $$$? What LLM did you prefer most during development? Any suggestions for other solo developers trying to launch their LLM-built product?
Doesn't this go directly against what the author is asking about? You're much less likely to enter flow state if you have a team of AI agents which are supposed to be autonomous.
I'm in the middle of it so don't have any conclusions for you, but I started mucking with building my own cli coding app and there are _tons_ of levers available that aren't apparent from claude code or codex.
Including altering the turn concept. I think it is still ultimately call and response but instead of everything is a quarter note you can get a little closer to a beat you like.
Same here. Every time I come up with a complicated way to personalize my workflow I end up finding out there's already a better way to do it.
The only thing that I consistently do is create a simple html dashboard with a to-do list I can guide claude code with while rendering progress somewhat graphically. I love the levers but it's kinda the opposite of the flow in question.
It's love to have an interface where I could have several conversations with an LLM, with common context, but each separate and anchored to a different place in the code
I have a custom harness that runs in a macOS VM. It has e-mail and its own accounts. I assign it tasks in Linear, it does them and spins up PRs for me to review. This works pretty well, generally. I have to spend time writing stories and doing code review, but I don’t have to follow its (their — I have 3 of them) every move.
I'm trying to do the same amount of work faster, not do work in parallel or agent orchestration. I'm not against letting the model go off and do things on it's own, that has its time and place.
But if I can do something in 15 minutes instead of 1 hour without the annoying prompt response loop, without the feeling that there could be blind spots, and while keeping all of the context (or at least most) in my head. That's a bigger win than spinning up 5 agents to do different things.
One of the things I've been talking about with my senior developers is how the bottleneck has shifted even more dramatically to human code understanding vs code generation. AI is still not suitable for generating production grade code without a human checking it (yet), but it can produce a huge amount of code for humans to check. We've been experimenting with ai finding better ways of communicating what is in a change at different abstraction levels etc by always generating diagrams showing what it did etc, with the concept being that anything that can speed up human understanding of changes addresses the core bottleneck of the whole process.
using tools like claude code and codex constantly boosts our dopamine, making hyperflow impossible. these days, most engineers work on multiple projects simultaneously to satisfy their dopamine receptors.
I think the value right now is to focus less on external orchestration if at all. trust the (current best) model to do it better than anything you bolt on to the harness. focus your energy on providing clearer specs. I think the optimal spec is a disambiguated (through liberal use of the AskUserQuestion tool) 1 intent, 2, input/output contracts 3 constraints and 4 preconditions. focus on that and get out of the models way. I think of it like this, imagine a person who was not as smart as you was trying to tell you how to do a task. would you want more verbosity and step by step instructions or would you want them to just cut to the chase (ie, what are you trying to do, what are the obstacles, I'll let you know if I have questions).
also let the model verify itself. don't give it an objective that is vague, give it clear exit criterias for goals and let it loop until it gets there
so much of the orchestration scaffolding seems like massive technical debt
oddly, I do the opposite of a lot of conventional advice when it comes to models. I use no memory, I think there is something similar to context rot when everything is stored. I like creating markdown files as memory that the model can grep if needed. I also havent found a real use for hooks yet, I have tried but they always seem to get in the way. skills on the other hand are very undervalued. they are so much more powerful than many realize. I used to think agents were where the power was. I think its actually skills. agents are really for context preservation. skills are what increase capabilities
I'm not even talking about quantity of items in memory, I mean dilution of intent. I really love a model with a clean slate and only the items it needs. I fear the memory guides the model in areas that might not be what I want with the current prompt
progressive disclosure is a big one. you can make context available but it is only loaded when needed. like lazy loading for prompt engineering. skills are to be used to instruct the model how to do something specific that is not in its training data. like how to access my proprietary system, how to interface with a custom program. you can embed templates in skills, you can embed code that executes in skills and only the output is loaded into context. skills expand capabilities, agents constrain context
(constraining context is a very good thing btw, don't mean to infer that agents are somehow inferior to skills)
It feels like everyone and their grandma is building an agent orchestrator at the moment, but I'm not hearing a lot of success stories. The fact that Anthropic and OpenAI haven't laid off all their software engineers already is probably a sign that orchestration breaks down somewhere. I suspect it's just a more elaborate way of burning tokens. I'm still interested in experimenting though.
I'm currently rolling out Matt Pocock's Sandcastle project so that I can have those brakes removed. What will be left is just the grilling(/wayfinding).
My current flow heavily relies on Matt Pocock's Skills and Sandcastle project.
I find them highly valuable in practice: grilling(/wayfind) into a spec and extract issues. Those live in Linear projects.
I'm pointing my Sandcastle set-up at such Linear projects (or loose issues), which results in an MR.
Currently at the point of self-improving the prompts and Sandcastle set-up with a retrospective pass of the logs.
I don't think you would expect to get into a flow state if you were intermittently directing another (human) programmer to do work, and you shouldn't expect to with LLM-driven coding either. Perhaps you are best finding out ways to extend the length of time where the LLM can work without prompting, then use that downtime to focus on other tasks that will help you to guide it better the next time you need to prompt it.
I feel the opposite. Creating a DTO or wiring up a CQRS command takes me out of the flow. And while I enjoy a good refactoring, it would be nice if I could just have it refactor code in the background while I'm still working in the same file.
Well, it always depends on your environment. In my case, nothing forces me to heavily use AI, so my workflow is kind of the old way, but with less hassle.
- Do your thinking alone. (AI part: search, understanding)
- Specing. (AI part: search, understanding, completing some text)
- Coding like the old days. (AI part: search, understanding, code examples)
- Okay, now I have a good idea of how my feature is going to work
- Look for fluff code and delegate it to AI to write/review it.
- Focus on the part of the code I want to have fun doing.
- Review.
- Repeat.
It’s slower than the approach of doing specs and letting AI do the rest, while focusing your role only on code review. However, I’m more in control of what I build, I can explain what I built better than everyone else, and I build up my knowledge. (also I have less problems, because less code haha)
Will I go for the full Agentic way ? Maybe but I will find a way to slow it down so I can be in control
I created a small PI extension that always watches relevant directories and answers me in place, without switching context, or using a chat interface. Still experimenting but I like it.
Ive built a couple things in the past few months that have leaned heavily on LLM as my programmer. Mainly Claude code, but occasionally codex also. Its a different way to produce. I spend more time doing something like plain text feature mapping. simple .md files, good flow and creativity. Then once i'm happy with it, i pass it off to the dev team- claude to code up and integrate. I feel like im flowing in the part of the process I always was. But the buzz of getting something working is gone. More like slow satisfaction of getting something useful at the end.
The fundamental problem i keep seeing across all harnesses is the use of the exact same UX afforded by a git based backend. If we want to stay in flow, the LLMs edit backend would have to be based off something like crdts to handle simultaneous edits.
My current approach which I've been testing on two MVPs with what I would call 'moderate success' (but hey, actual success!)
3 tier, philosophy-spec-design. Increasing detail. Design files include db model explanations and pseudocode/function headers - that level of detail.
For each thing I need to change, I have a, prompt ready to go to ask the agent to follow about 5 steps and it outputs a 'reviewfile' with details of what it things about the thing I posited. I review its output. I have another prompt ready to then get an agent to generate a taskfile + update the design documentation. The taskfile explains in great detail what has changed and what needs to be implemented. I review the taskfile and got diffs of the design doc changes. Finally an agent implements the taskfile. I review all changed code and commit.
It gets there, but still definitely misses some stuff. It's very adequate for a MVP I'm finding.
Edit: this seems to only work with Opus. Sonnet can't do it (maybe I'm just lucky and Opus is seriously compensating for an awful approach and I'm just lucky?)
I've been working on inverting the control theory for the agent loop. Instead of the user initiating everything, the agent runs automatically in the background and calls the user for feedback as part of tool use. The end game for me is to get rid of the chat interface altogether and move back toward async email and other messaging channels. The chatbot UI as a means of driving the business always felt like a temporary stepping stone / clever demo.
I think there are 10-100x productivity gains lurking in here. It is very expensive for a human to reserialize their mental state into a prompt each time a task needs working on. An agent can do this ~instantly and with high frequency 24/7. The higher the rate of evaluation the less change has to be dealt with between any two iterations. So, the likelihood that a given iteration needs human help goes down as you increase the rate of evaluation per unit of wall clock time. Tighter and faster control loops tend to require less severe corrective measures than slow and sloppy ones.
This is the most plausible reason for so many tokens in the future. I can actually see a million tokens per second making sense. I have a pretty good idea how I'd approach this if I actually had access to this kind of infrastructure. 1Mtok/s is baby tier in terms of raw information theory. The politics of employing a system like this are far more terrifying to me than any technological aspects. Humans really like having control over things, even when that control is pure downside for the business.
I build a lot of my own tools, to suit exactly how I want to work. Obviously, having a little thinky guy in the computer to do most of the busy work of making new tools accelerates that, but tools that make the LLMs suit me also accelerates my general work.
Some stuff I've built:
https://github.com/swelljoe/tandem - Tandem is a sysadmin buddy that travels with you over ssh. Just a wrapper over tmux and claude code (or whatever agent you like), it opens two panes in tmux, one with an ssh session to one of the hundreds of devices I maintain, and one with a local Claude Code configured to use a local work space and instructed via CLAUDE.md/AGENTS.md to use tmux to interact with the remote machine. I built it because a lot of my coworkers were installing Claude Code on our robots and authenticating there to get help with robot troubles, and that felt bad. This allows them to keep all sensitive stuff locally and still get help troubleshooting directly on the device. I happen to find it useful, sometimes, too.
https://github.com/swelljoe/nelson - Nelson is a fancy Ralph loop for security bug hunting that I built to help audit my own software. It's also grown to include a benchmark suite I'm using to figure out which models are worth using for security work. I've published some of those benchmark results, and have a few hundred hours/dollars worth of new ones to publish this weekend. Turns out the benchmarking is more interesting, so that's gotten more attention than the bug-hunting side, but the benchmarks inform how the bug-hunting side works, and I added multi-model/multi-pass scans and de-dupe features recently because I found that letting models have a couple bites at the apple increases discovery, and there are bugs that only some models catch, and it's not always the top model that finds them. There's some overlap, but also some divergence. This research has also led me to start working on a harness for security auditing tasks; giving the agent tools and project structure data to lift detection and reduce false positives.
https://github.com/swelljoe/flar - FLAR is the Fast Light Agent Restrictor. It bubblewraps an agent so it is quite safe to use agents on your local machine, even with `--dangerously-skip-permissions` (which makes agents more fun to use). The sandbox feature found in most agents is porous and can be expanded by the agent harness itself. Similarly, if the agent introduces a supply chain attack into your code and runs it before you get a chance to audit/review it in a PR or run it through an SBOM dependency checker, the blast radius is exactly the project directory and the credentials/history of the one agent. (Whereas, without flar, the blast radius is your whole .ssh, github creds, all agent creds, your keyring, whatever secrets are in your home, etc.) This one is new. Just made it because I was talking about how I always put agents in VMs because I don't trust them. Someone suggested `srt` (https://github.com/anthropic-experimental/sandbox-runtime) and I like the idea but I don't like how complicated and huge and JavaScript it is. You can read and understand the entirety of `flar` in one sitting. Anyway, to break out of "prompt/response", you have to skip permissions, or call it via `claude -p` or API with tasks to perform. Nelson does the latter and `flar` does the former.
That's not to mention all the side projects and other stuff I've been able to make a lot of progress on.
The biggest one is finishing https://venturous.app/ (or, at least I made it do what I most wanted it to do, which is provide map overlays of US public lands and mobile data provider coverage so I can find cool places to camp free while staying connected). This is a re-implementation of an old defunct app called FreeRoam that I absolutely loved when I traveled full-time. I built half of it over several months by hand, and then Claude helped finish it in a few weekends and holidays. I'll get Claude to help build the mobile apps someday.
I’m writing a JSX templating language — to manage context, branching, etc automatically. You hand it a spec/existing work and it automatically applies a recipe.
So far that’s been much nicer for anything large or complex, because I was spending all my time on context piping.
YES!
It's still very wip, I spent a couple of weekends on it so far, but I'm working on a harness that eschews autonomy and instead aims to work as a pair programming partner. Key to that are distinct "driver" and "navigator" modes, with the capacity to flip between them rapidly.
https://gitlab.com/philbooth/opair
(not really usable yet, but after tomorrow's session I expect to be developing opair in opair, which is mildly exciting)
Love this idea and will be following closely! I've wanted a pair programmer style interaction for a while now. Something closer to VSCode's Copilot inline conversations and FIM, but where it's continuously watching what I'm doing and ruminating on suggestions.
I'm building "workboxes" to work on my startup. It helps me develop features insanely fast. A workbox is a simple worktree-in-a-sandbox per feature. I have a simple front end where I can launch new workboxes: I input a prompt (a documented grilling session) and it creates a branch, a PR, and starts an opencode coding session on an e2b sandbox based on a custom template with the app's monorepo. Each workbox has a public https endpoint so I can manually test the web app after the coding session is complete. At any point I can either approve the PR, send a follow-up prompt, or connect to the opencode session for more control.
I think my next step is to perform the grilling session inside the front end, currently I perform it in my terminal and then paste in the front end.
Is it similar to how Claude Code Web works? It generates a cloud container and clones your repo, and works on a whatever you want (preferebly something specific), and then it generates a branch and a PR.
Not being able to enter flow state is a very interesting observation. I've felt it too to the extent that I went down a whole new rabbit hole of what it means to be in flow state. Let me know if anybody here wants to know more, happy to post some links.
To answer your question - I discuss the approach with Claude Code (e.g., should I implement my own ACT model in JAX or PyTorch, Python or Rust or Julia, etc.). Then write the initial part of the code myself. Opening up a blank vscode is a simple joy of life I refuse to give up :-) I'll ask Claude for advice if I get stuck, it will helpfully offer to write that code for me, I obstinately decline. Eventually, I'll get bored of some minutiae or other, at which point I'll ask Claude to complete just that part of it.
Ok here are the flow related links. This was about 1.5 years ago when I was trying to figure out burnout and it turned out flow (or lack thereof) was closely related.
I'd be interested in the rabbit hole of flow state. Also with regards to the dopamine rewards of solving a bug as motivation.
Sometimes using a LLM can assist these and sometimes it can feel like cheating myself out of a good thing and I'm not entirely sure where the borders are. It could also be related to a sense of ownership or pride in ones work and seeing the value in doing quality work.
I'd love to have some links please :)
Yes, like many others I've been experimenting a lot. What I've got so far is a harness-of-harnesses - ie, a harness which sits on top of Claude Code, Codex or OpenCode. I still use Claude Code or Codex directly for the initial planning of features, to investigate issues, and for small fixes, but whenever there's something even just a bit complex to do, I use my second-level harness.
Summarizing it a lot, what it does is:
* help you make better plans
* split plans into iterations, in a module-aware way for projects which have strict modularity (for now I'm doing this specifically with TypeScript and dependency cruiser) - this helps a lot when a project becomes complex
* ask an agent to implement an iteration, and then programmatically run a lot of checks after each iteration - not just regression tests, but also checks against project principles and conventions
* when possible, automatically fix deviations; when not possible, raise them to myself for an end-of-plan review
In this way, instead of having to constantly be engaged with the chat interface, with all the shorter or longer wait times which break my flow, I spend a lot of highly focused time during initial planning and final review. A plan implementation can go on for hours, and the various anchoring mechanisms added to the tool keep drift to a minimum.
At some point I'm planning to release this tool as open source. As this is the result of months of trial and errors, dogfooding, and vibecoding on the tool itself, the codebase is chaotic and the UI is still full of experiments I mostly basically abandoned, and I'm not used to releasing stuff in this status. But perhaps, in this brave new world, I should just do it and see what happens?
I think the fundamental aspect of flow is that it requires a high amount of cognitive engagement. Most of the time you're just not getting that from interacting with an LLM because the process is relatively passive. There are also forced breaks while it does its internal CoT which breaks flow.
I think a lot of people get a sort of novelty effect when first interacting with an LLM which can feel superficially like flow, but it's different in that it eventually wanes and what really happens in practice is you're encouraged to disengage and this makes it almost impossible to get into a true flow state.
The risk here I think is that if you get humans disengaging from the task at hand, there's a higher chance of bugs being introduced. You might move slightly faster in the short term but be forced to hit the brakes in the medium/long term.
If you like videos, I saw an interesting video yesterday about systems thinking, software as ecosystem particularly with AI. More of an overview but gives an insight into seeing where we might be able to experiment with different ways Its more focused on teams and companies than individual developers but I think it could be applied to the single dev.
"Software engineering at the tipping point" https://www.youtube.com/watch?v=2n41YjR5QfU
Something I'm thinking about and doing a bit of experimentation with is using LLMs to write specialist higher level code.
Rather than ask them to write web-apps in webby languages with open source frameworks etc, providing a very fixed, on-rails development process where everything is abstracted away. Accept that it'll be less powerful, but take the trade-off that it'll hopefully be faster and produce much more controllable software.
Concrete example, why do we let the LLM choose a database, schema, migration procedure, library, etc. We could decide to only support one database, enforce schema design (such as every table containing access control), enforce a migration process, enforce a library, even do schema design in a fixed config file rather than arbitrary DDL. Same for auth, deployments, even UI.
This sounds a bit like Ruby on Rails including Hotwire? Even has the “on-rails development” in the name, schema design in a config, migrations, etc.
Though some frontend decisions are a bit more open
I'm in the same boat and I'm not a fan on the current way of working of agents, but I think tooling is what needs to catch up.
So, I actually decided to try to tackle it myself and worked some months (full time) on it.
https://beolis.com is the result of that, it's a local cli in a kanban board style with a remote server to keep the team on track (I've been using it myself for some time and actually started to ask some friends to use it just yesterday -- feedback very welcome, I still wanted to do some additional things before asking more people to use it, but oh well, I'm a fan of building in public anyways and it's probably better to have feedback sooner rather than later).
The main point there is that you work mostly in the ticket description (your own spec) and the plan (the spec as the agent sees it, generated with a custom workflow) and then having another custom workflow to implement it (you can choose how you want it -- https://beolis.com/blog/post/custom-coding-workflows has some info on what I'm using myself).
As a result, at least for me, I do spend more time immersed in a flow state (although I'm in that state writing the specs and reviewing code -- although in some cases it's more work to write the spec in a way the agent can work when things get more complicated vs just diving into the code, so, going into "code" mode is something I still have to do, agents are definitely not perfect).
I guess I'm lacking in docs on how to effectively use it. I have plans to create a video next week and post it in the blog, so, if you're interested, keep track of it ;)
Just yesterday I tried to find an annoying and persistent bug in the cummunication between a Lyrion Media Server and my player. I used Opencode's native Big Pickle AI, and first it was a pain in the back, because it gave me a new code, I had to start the player and test the control in the server's web GUI, report the errors back, and so forth, and it tried a lot, but never found the real cause.
Then I got tired, and told it to use PlayWright to control the browser and test by itself. After some hangs, that I had to stop manually, it did all by itself, and finally fixed the bug. I had to increase the agents' steps setting in the config, but that was it. While it was fixing the bug, I surfed the web, and kept an eye on it, but it did everything on it's own. impressive.
I am currently in the process of launching my AI teams platform that I've been working on since at least January. It's https://PersonaStack.ai. I'm doing it without VC money and all by myself. I've used over 110B tokens so far building it.
You get some amazing results with teams of AIs if you do it right. The key is to control behavior with what integrations and responsibilities each agent has. That way they naturally adapt, delegate, fact check each other, and generally act more autonomously.
This is already running the automated news site ainews.personastack.ai complete with social media posts 100% automated.
It also runs the issue triage, coding, reviews, and releases for the Kuberhealthy open source CNCF project, which is another thing of mine.
I don't think the next step is really smarter models. It's how we make the models more effective, and teams, when done right, net the best results I've seen.
Hoping to get noticed here soon, but it's extremely hard to do solo I'm finding.
Very impressive, especially since you did it solo. The website looks great and explains everything in detail.
Can you elaborate more about its development? How much do 110B tokens equate to in $$$? What LLM did you prefer most during development? Any suggestions for other solo developers trying to launch their LLM-built product?
Doesn't this go directly against what the author is asking about? You're much less likely to enter flow state if you have a team of AI agents which are supposed to be autonomous.
Maybe project managers will finally get to experience flow states?
I'm in the middle of it so don't have any conclusions for you, but I started mucking with building my own cli coding app and there are _tons_ of levers available that aren't apparent from claude code or codex.
Including altering the turn concept. I think it is still ultimately call and response but instead of everything is a quarter note you can get a little closer to a beat you like.
Same here. Every time I come up with a complicated way to personalize my workflow I end up finding out there's already a better way to do it.
The only thing that I consistently do is create a simple html dashboard with a to-do list I can guide claude code with while rendering progress somewhat graphically. I love the levers but it's kinda the opposite of the flow in question.
It's love to have an interface where I could have several conversations with an LLM, with common context, but each separate and anchored to a different place in the code
I have a custom harness that runs in a macOS VM. It has e-mail and its own accounts. I assign it tasks in Linear, it does them and spins up PRs for me to review. This works pretty well, generally. I have to spend time writing stories and doing code review, but I don’t have to follow its (their — I have 3 of them) every move.
A) spec driven development
B) opinionated skills that use GitHub tickets, merge gates and execution of ticket graphs
Just want to add:
I'm trying to do the same amount of work faster, not do work in parallel or agent orchestration. I'm not against letting the model go off and do things on it's own, that has its time and place.
But if I can do something in 15 minutes instead of 1 hour without the annoying prompt response loop, without the feeling that there could be blind spots, and while keeping all of the context (or at least most) in my head. That's a bigger win than spinning up 5 agents to do different things.
One of the things I've been talking about with my senior developers is how the bottleneck has shifted even more dramatically to human code understanding vs code generation. AI is still not suitable for generating production grade code without a human checking it (yet), but it can produce a huge amount of code for humans to check. We've been experimenting with ai finding better ways of communicating what is in a change at different abstraction levels etc by always generating diagrams showing what it did etc, with the concept being that anything that can speed up human understanding of changes addresses the core bottleneck of the whole process.
using tools like claude code and codex constantly boosts our dopamine, making hyperflow impossible. these days, most engineers work on multiple projects simultaneously to satisfy their dopamine receptors.
I think the value right now is to focus less on external orchestration if at all. trust the (current best) model to do it better than anything you bolt on to the harness. focus your energy on providing clearer specs. I think the optimal spec is a disambiguated (through liberal use of the AskUserQuestion tool) 1 intent, 2, input/output contracts 3 constraints and 4 preconditions. focus on that and get out of the models way. I think of it like this, imagine a person who was not as smart as you was trying to tell you how to do a task. would you want more verbosity and step by step instructions or would you want them to just cut to the chase (ie, what are you trying to do, what are the obstacles, I'll let you know if I have questions).
also let the model verify itself. don't give it an objective that is vague, give it clear exit criterias for goals and let it loop until it gets there so much of the orchestration scaffolding seems like massive technical debt
oddly, I do the opposite of a lot of conventional advice when it comes to models. I use no memory, I think there is something similar to context rot when everything is stored. I like creating markdown files as memory that the model can grep if needed. I also havent found a real use for hooks yet, I have tried but they always seem to get in the way. skills on the other hand are very undervalued. they are so much more powerful than many realize. I used to think agents were where the power was. I think its actually skills. agents are really for context preservation. skills are what increase capabilities
I'm not even talking about quantity of items in memory, I mean dilution of intent. I really love a model with a clean slate and only the items it needs. I fear the memory guides the model in areas that might not be what I want with the current prompt
progressive disclosure is a big one. you can make context available but it is only loaded when needed. like lazy loading for prompt engineering. skills are to be used to instruct the model how to do something specific that is not in its training data. like how to access my proprietary system, how to interface with a custom program. you can embed templates in skills, you can embed code that executes in skills and only the output is loaded into context. skills expand capabilities, agents constrain context
(constraining context is a very good thing btw, don't mean to infer that agents are somehow inferior to skills)
It feels like everyone and their grandma is building an agent orchestrator at the moment, but I'm not hearing a lot of success stories. The fact that Anthropic and OpenAI haven't laid off all their software engineers already is probably a sign that orchestration breaks down somewhere. I suspect it's just a more elaborate way of burning tokens. I'm still interested in experimenting though.
I'm currently rolling out Matt Pocock's Sandcastle project so that I can have those brakes removed. What will be left is just the grilling(/wayfinding).
My current flow heavily relies on Matt Pocock's Skills and Sandcastle project. I find them highly valuable in practice: grilling(/wayfind) into a spec and extract issues. Those live in Linear projects. I'm pointing my Sandcastle set-up at such Linear projects (or loose issues), which results in an MR.
Currently at the point of self-improving the prompts and Sandcastle set-up with a retrospective pass of the logs.
I don't think you would expect to get into a flow state if you were intermittently directing another (human) programmer to do work, and you shouldn't expect to with LLM-driven coding either. Perhaps you are best finding out ways to extend the length of time where the LLM can work without prompting, then use that downtime to focus on other tasks that will help you to guide it better the next time you need to prompt it.
I feel the opposite. Creating a DTO or wiring up a CQRS command takes me out of the flow. And while I enjoy a good refactoring, it would be nice if I could just have it refactor code in the background while I'm still working in the same file.
Well, it always depends on your environment. In my case, nothing forces me to heavily use AI, so my workflow is kind of the old way, but with less hassle.
- Do your thinking alone. (AI part: search, understanding)
- Specing. (AI part: search, understanding, completing some text)
- Coding like the old days. (AI part: search, understanding, code examples)
- Okay, now I have a good idea of how my feature is going to work
- Look for fluff code and delegate it to AI to write/review it.
- Focus on the part of the code I want to have fun doing.
- Review.
- Repeat.
It’s slower than the approach of doing specs and letting AI do the rest, while focusing your role only on code review. However, I’m more in control of what I build, I can explain what I built better than everyone else, and I build up my knowledge. (also I have less problems, because less code haha)
Will I go for the full Agentic way ? Maybe but I will find a way to slow it down so I can be in control
I created a small PI extension that always watches relevant directories and answers me in place, without switching context, or using a chat interface. Still experimenting but I like it.
https://github.com/piqoni/pi-piqo
Computers are like a bicycle for the mind.
LLM AI is like Uber for the mind.
Ive built a couple things in the past few months that have leaned heavily on LLM as my programmer. Mainly Claude code, but occasionally codex also. Its a different way to produce. I spend more time doing something like plain text feature mapping. simple .md files, good flow and creativity. Then once i'm happy with it, i pass it off to the dev team- claude to code up and integrate. I feel like im flowing in the part of the process I always was. But the buzz of getting something working is gone. More like slow satisfaction of getting something useful at the end.
I keep a TODO file where I just write my ideas in free text, and every once in a while I tell claude "I updated the TODO file".
This is basically like queueing up prompt.
I wish Claude Code had a thing like that builtin. Like a "user ideas scratchpad".
The fundamental problem i keep seeing across all harnesses is the use of the exact same UX afforded by a git based backend. If we want to stay in flow, the LLMs edit backend would have to be based off something like crdts to handle simultaneous edits.
My current approach which I've been testing on two MVPs with what I would call 'moderate success' (but hey, actual success!)
3 tier, philosophy-spec-design. Increasing detail. Design files include db model explanations and pseudocode/function headers - that level of detail.
For each thing I need to change, I have a, prompt ready to go to ask the agent to follow about 5 steps and it outputs a 'reviewfile' with details of what it things about the thing I posited. I review its output. I have another prompt ready to then get an agent to generate a taskfile + update the design documentation. The taskfile explains in great detail what has changed and what needs to be implemented. I review the taskfile and got diffs of the design doc changes. Finally an agent implements the taskfile. I review all changed code and commit.
It gets there, but still definitely misses some stuff. It's very adequate for a MVP I'm finding.
Edit: this seems to only work with Opus. Sonnet can't do it (maybe I'm just lucky and Opus is seriously compensating for an awful approach and I'm just lucky?)
I've been working on inverting the control theory for the agent loop. Instead of the user initiating everything, the agent runs automatically in the background and calls the user for feedback as part of tool use. The end game for me is to get rid of the chat interface altogether and move back toward async email and other messaging channels. The chatbot UI as a means of driving the business always felt like a temporary stepping stone / clever demo.
I think there are 10-100x productivity gains lurking in here. It is very expensive for a human to reserialize their mental state into a prompt each time a task needs working on. An agent can do this ~instantly and with high frequency 24/7. The higher the rate of evaluation the less change has to be dealt with between any two iterations. So, the likelihood that a given iteration needs human help goes down as you increase the rate of evaluation per unit of wall clock time. Tighter and faster control loops tend to require less severe corrective measures than slow and sloppy ones.
This is the most plausible reason for so many tokens in the future. I can actually see a million tokens per second making sense. I have a pretty good idea how I'd approach this if I actually had access to this kind of infrastructure. 1Mtok/s is baby tier in terms of raw information theory. The politics of employing a system like this are far more terrifying to me than any technological aspects. Humans really like having control over things, even when that control is pure downside for the business.
Related question: are there any close-to-gpt5.5/opus-level good autocompletion models?
when it comes to autocomplete, the harness matters more than the model
I build a lot of my own tools, to suit exactly how I want to work. Obviously, having a little thinky guy in the computer to do most of the busy work of making new tools accelerates that, but tools that make the LLMs suit me also accelerates my general work.
Some stuff I've built:
https://github.com/swelljoe/tandem - Tandem is a sysadmin buddy that travels with you over ssh. Just a wrapper over tmux and claude code (or whatever agent you like), it opens two panes in tmux, one with an ssh session to one of the hundreds of devices I maintain, and one with a local Claude Code configured to use a local work space and instructed via CLAUDE.md/AGENTS.md to use tmux to interact with the remote machine. I built it because a lot of my coworkers were installing Claude Code on our robots and authenticating there to get help with robot troubles, and that felt bad. This allows them to keep all sensitive stuff locally and still get help troubleshooting directly on the device. I happen to find it useful, sometimes, too.
https://github.com/swelljoe/nelson - Nelson is a fancy Ralph loop for security bug hunting that I built to help audit my own software. It's also grown to include a benchmark suite I'm using to figure out which models are worth using for security work. I've published some of those benchmark results, and have a few hundred hours/dollars worth of new ones to publish this weekend. Turns out the benchmarking is more interesting, so that's gotten more attention than the bug-hunting side, but the benchmarks inform how the bug-hunting side works, and I added multi-model/multi-pass scans and de-dupe features recently because I found that letting models have a couple bites at the apple increases discovery, and there are bugs that only some models catch, and it's not always the top model that finds them. There's some overlap, but also some divergence. This research has also led me to start working on a harness for security auditing tasks; giving the agent tools and project structure data to lift detection and reduce false positives.
https://github.com/swelljoe/flar - FLAR is the Fast Light Agent Restrictor. It bubblewraps an agent so it is quite safe to use agents on your local machine, even with `--dangerously-skip-permissions` (which makes agents more fun to use). The sandbox feature found in most agents is porous and can be expanded by the agent harness itself. Similarly, if the agent introduces a supply chain attack into your code and runs it before you get a chance to audit/review it in a PR or run it through an SBOM dependency checker, the blast radius is exactly the project directory and the credentials/history of the one agent. (Whereas, without flar, the blast radius is your whole .ssh, github creds, all agent creds, your keyring, whatever secrets are in your home, etc.) This one is new. Just made it because I was talking about how I always put agents in VMs because I don't trust them. Someone suggested `srt` (https://github.com/anthropic-experimental/sandbox-runtime) and I like the idea but I don't like how complicated and huge and JavaScript it is. You can read and understand the entirety of `flar` in one sitting. Anyway, to break out of "prompt/response", you have to skip permissions, or call it via `claude -p` or API with tasks to perform. Nelson does the latter and `flar` does the former.
That's not to mention all the side projects and other stuff I've been able to make a lot of progress on.
The biggest one is finishing https://venturous.app/ (or, at least I made it do what I most wanted it to do, which is provide map overlays of US public lands and mobile data provider coverage so I can find cool places to camp free while staying connected). This is a re-implementation of an old defunct app called FreeRoam that I absolutely loved when I traveled full-time. I built half of it over several months by hand, and then Claude helped finish it in a few weekends and holidays. I'll get Claude to help build the mobile apps someday.
> but I haven't been able to enter flow state like I can when I hand write code.
Fixing that for you.
I haven't been able to enter flow state like I can when I write code.
Prompts are a higher-level programming language.
You are the bottleneck.
Why should AI be limited to human time. Is a mountain? A galaxy?
I’m writing a JSX templating language — to manage context, branching, etc automatically. You hand it a spec/existing work and it automatically applies a recipe.
So far that’s been much nicer for anything large or complex, because I was spending all my time on context piping.