Is the llm an expensive way to solve this? Would a more predictive model type be better? Then the llm summarizes the PR and the model predicts the likelihood of needing to update the doc?
Does using a llm help avoid the cost of training a more specific model?
Well, it's all about linguistic relativism, right? If you can define "user harm" in terms of things it does understand, I think you could get something that works
We kept asking LLMs to rate things on 1-10 scales and getting inconsistent results. Turns out they're much better at arguing positions than assigning numbers— which makes sense given their training data. The courtroom structure (prosecution, defense, jury, judge) gave us adversarial checks we couldn't get from a single prompt. Curious if anyone has experimented with other domain-specific frameworks to scaffold LLM reasoning.
Experimented very briefly with a mediation (as opposed to a litigation) framework but it was pre-LLM and it was just a coding/learning experience: https://github.com/dvelton/hotseat-mediator
Cool write-up of your experiment, thanks for sharing. Would be interesting to see how results from one framework (mediation, whose goal is "resolution") differ from the other (litigation, whose goal is, basically, "truth/justice").
That's really cool! That's actually the standpoint we started with. We asked what a collaborative reconciliation of document updates looks like. However, the LLMs seemed to get `swayed` or showed `bias` very easily. This brought up the point about an adversarial element. Even then, context engineering is your best friend.
You kind of have to fine-tune what the objectives are for each persona and how much context they are entitled to, that would ensure an objective court proceeding that has debates in both directions carry equal weight!
I love your point about incentivization. That seems to be a make-or-break element for a reasoning framework such as this.
The reasoning gains make sense but I am wondering about the production economics. Running four distinct agent roles per update seems like a huge multiplier on latency and token spend. Does the claimed efficiency actually offset the aggregate cost of the adversarial steps? Hard to see how the margins work out if you are quadrupling inference for every document change.
The funnel is the answer to this. We're not running four agents on every PR — 65% are filtered before review even begins, and 95% of flagged PRs never reach the courtroom. This is because we do think there's some value in a single agent's judgment, and the prosecutor gets to make a choice when to file charges vs not.
Only ~1-2% of PRs trigger the full adversarial pipeline. The courtroom is the expensive last mile, deliberately reserved for ambiguous cases where the cost of being wrong far exceeds the cost of a few extra inference calls. Plus you can make token/model-based optimizations for the extra calls in the argumentation system.
you would think so! but that's only optimal if the model already has all the information in recent context to make an optimally-informed decision.
in practice, this is a neat context engineering trick, where the different LLM calls in the "courtroom" have different context and can contribute independent bits of reasoning to the overall "case"
That's the thing with documentation; there are hardly any situations where a simple true/false works. Product decisions have many caveats and evolving behaviors coming from different people. At that point, a numerical grading format isn't something we even want — we want reasoning, not ratings.
Defence attourney: "Judge, I object"
Judge: "On what grounds?"
Defence attourney: "On whichever grounds you find most compelling"
Judge: "I have sustained your objection based on speculation..."
This post could be an entire political campaign against AI and it's danger to humankind and jobs of BILLIONS
Quick summary of how dumb and dangerous generative AI can be.
How so? Care to elaborate?
Is the llm an expensive way to solve this? Would a more predictive model type be better? Then the llm summarizes the PR and the model predicts the likelihood of needing to update the doc?
Does using a llm help avoid the cost of training a more specific model?
An LLM does not understand what "user harm" is. This doesn't work.
Well, it's all about linguistic relativism, right? If you can define "user harm" in terms of things it does understand, I think you could get something that works
We kept asking LLMs to rate things on 1-10 scales and getting inconsistent results. Turns out they're much better at arguing positions than assigning numbers— which makes sense given their training data. The courtroom structure (prosecution, defense, jury, judge) gave us adversarial checks we couldn't get from a single prompt. Curious if anyone has experimented with other domain-specific frameworks to scaffold LLM reasoning.
Experimented very briefly with a mediation (as opposed to a litigation) framework but it was pre-LLM and it was just a coding/learning experience: https://github.com/dvelton/hotseat-mediator
Cool write-up of your experiment, thanks for sharing. Would be interesting to see how results from one framework (mediation, whose goal is "resolution") differ from the other (litigation, whose goal is, basically, "truth/justice").
That's really cool! That's actually the standpoint we started with. We asked what a collaborative reconciliation of document updates looks like. However, the LLMs seemed to get `swayed` or showed `bias` very easily. This brought up the point about an adversarial element. Even then, context engineering is your best friend.
You kind of have to fine-tune what the objectives are for each persona and how much context they are entitled to, that would ensure an objective court proceeding that has debates in both directions carry equal weight!
I love your point about incentivization. That seems to be a make-or-break element for a reasoning framework such as this.
The reasoning gains make sense but I am wondering about the production economics. Running four distinct agent roles per update seems like a huge multiplier on latency and token spend. Does the claimed efficiency actually offset the aggregate cost of the adversarial steps? Hard to see how the margins work out if you are quadrupling inference for every document change.
The funnel is the answer to this. We're not running four agents on every PR — 65% are filtered before review even begins, and 95% of flagged PRs never reach the courtroom. This is because we do think there's some value in a single agent's judgment, and the prosecutor gets to make a choice when to file charges vs not.
Only ~1-2% of PRs trigger the full adversarial pipeline. The courtroom is the expensive last mile, deliberately reserved for ambiguous cases where the cost of being wrong far exceeds the cost of a few extra inference calls. Plus you can make token/model-based optimizations for the extra calls in the argumentation system.
If you do want a numeric scale, ask for a binary (e.g. true / false) and read the log probs.
(disclaimer: I work at Falconer)
you would think so! but that's only optimal if the model already has all the information in recent context to make an optimally-informed decision.
in practice, this is a neat context engineering trick, where the different LLM calls in the "courtroom" have different context and can contribute independent bits of reasoning to the overall "case"
That's the thing with documentation; there are hardly any situations where a simple true/false works. Product decisions have many caveats and evolving behaviors coming from different people. At that point, a numerical grading format isn't something we even want — we want reasoning, not ratings.