A quick clarification on intent, since “browser automation” means different things to different people:
This isn’t about making scripts smarter or replacing Playwright/Selenium. The problem I’m exploring is reliability: how to make agent-driven browser execution fail deterministically and explainably instead of half-working when layouts change.
Concretely, the agent doesn’t just “click and hope”. Each step is gated by explicit post-conditions, similar to how tests assert outcomes:
If the condition isn’t met, the run stops with artifacts instead of drifting forward. Vision models are optional fallbacks, not the primary control signal.
Happy to answer questions about the design tradeoffs or where this approach falls short
The WASM pass is fully deterministic: it’s just code running in the page to extract and prune post-rendered elements (roles, geometry, visibility, layout, etc), no agent involved in the chrome extension .
The “deterministic overrides” aren’t generated by a verifier agent either; they’re runtime rules that kick in when assertions or ordinality constraints are explicit (e.g. “first result”). The verifier just checks outcomes — it doesn’t invent actions. Because the nature of ai agents is non-deterministic, which we don’t want to introduce to the verification layer (predicate only).
Does the browser expose its accessibility tree instead of the raw dom element tree? The accessibility tree should be enough, I mean, it's all that's needed for vision impaired customers, and technically the ai agent _is_ a vision impaired customer. For a fair usage, try the accessibility tree.
The accessibility tree is definitely useful, and we do look at it. The issue we ran into is that it’s optimized for assistive consumption, not for action verification or layout reasoning on dynamic SPAs.
In practice we’ve seen cases where AX is incomplete, lags hydration, or doesn’t reflect overlays / grouping accurately. It does not support ordinality queries well. That’s why we anchor on post-rendered DOM + geometry and then verify outcomes explicitly, rather than relying on any single representation.
Great point about the accessibility tree @joeframbach. The "vision impaired customer" analogy is spot on - if an interface works for screen readers, it should work for AI agents.
What I find most compelling about this approach is the explicit verification layer. Too many browser automation projects fail silently or drift into unexpected states. The Jest-style assertions create a clear contract: either the step definitively succeeded or it didn't, with artifacts for debugging.
This reminds me of property-based testing - instead of hoping the agent "gets it right," you're encoding what success actually looks like.
Thanks — that’s exactly our motivation. The key shift for us was moving from “did the agent probably do the right thing?” to “can we prove the state we expected actually holds.”
The property-based testing analogy is a good one — once you make success explicit, failures become actionable instead of mysterious.
A quick clarification on intent, since “browser automation” means different things to different people:
This isn’t about making scripts smarter or replacing Playwright/Selenium. The problem I’m exploring is reliability: how to make agent-driven browser execution fail deterministically and explainably instead of half-working when layouts change.
Concretely, the agent doesn’t just “click and hope”. Each step is gated by explicit post-conditions, similar to how tests assert outcomes:
---- ## Python Code Example:
ready = runtime.assert_( all_of(url_contains("checkout"), exists("role=button")), "checkout_ready", required=True )
----
If the condition isn’t met, the run stops with artifacts instead of drifting forward. Vision models are optional fallbacks, not the primary control signal.
Happy to answer questions about the design tradeoffs or where this approach falls short
It is interesting subject matter, I am working on something similar. But the descriptions are quite terse. Maybe I just failed to gleam:
* When you "run a WASM pass", how is that generated? Do you use an agent to do the pruning step, or is it deterministic?
* Where do the "deterministic overrides" come from? I assume they are generated by the verifier agent?
The WASM pass is fully deterministic: it’s just code running in the page to extract and prune post-rendered elements (roles, geometry, visibility, layout, etc), no agent involved in the chrome extension .
The “deterministic overrides” aren’t generated by a verifier agent either; they’re runtime rules that kick in when assertions or ordinality constraints are explicit (e.g. “first result”). The verifier just checks outcomes — it doesn’t invent actions. Because the nature of ai agents is non-deterministic, which we don’t want to introduce to the verification layer (predicate only).
I took a look at the quickstart with aim of running this locally and found that an API key is needed for the importance ranking.
What exactly is importance ranking? Does the verification layer still exists without this ranking?
Does the browser expose its accessibility tree instead of the raw dom element tree? The accessibility tree should be enough, I mean, it's all that's needed for vision impaired customers, and technically the ai agent _is_ a vision impaired customer. For a fair usage, try the accessibility tree.
The accessibility tree is definitely useful, and we do look at it. The issue we ran into is that it’s optimized for assistive consumption, not for action verification or layout reasoning on dynamic SPAs.
In practice we’ve seen cases where AX is incomplete, lags hydration, or doesn’t reflect overlays / grouping accurately. It does not support ordinality queries well. That’s why we anchor on post-rendered DOM + geometry and then verify outcomes explicitly, rather than relying on any single representation.
Great point about the accessibility tree @joeframbach. The "vision impaired customer" analogy is spot on - if an interface works for screen readers, it should work for AI agents.
What I find most compelling about this approach is the explicit verification layer. Too many browser automation projects fail silently or drift into unexpected states. The Jest-style assertions create a clear contract: either the step definitively succeeded or it didn't, with artifacts for debugging.
This reminds me of property-based testing - instead of hoping the agent "gets it right," you're encoding what success actually looks like.
Thanks — that’s exactly our motivation. The key shift for us was moving from “did the agent probably do the right thing?” to “can we prove the state we expected actually holds.”
The property-based testing analogy is a good one — once you make success explicit, failures become actionable instead of mysterious.
You realize you are responding to a brand new account posting an obviously AI-generated response?
Slop shit discussing slop shit.