Astro - Hacker News

15 comments

test6554 39 minutes ago

Defence attourney: "Judge, I object"
Judge: "On what grounds?"
Defence attourney: "On whichever grounds you find most compelling"
Judge: "I have sustained your objection based on speculation..."
[-]
- iberator 34 minutes ago
  
  This post could be an entire political campaign against AI and it's danger to humankind and jobs of BILLIONS
  [-]
  - iberator 10 minutes ago
    
    Quick summary of how dumb and dangerous generative AI can be.
  - aryamanagraw 19 minutes ago
    
    How so? Care to elaborate?
jpollock an hour ago

Is the llm an expensive way to solve this? Would a more predictive model type be better? Then the llm summarizes the PR and the model predicts the likelihood of needing to update the doc?
Does using a llm help avoid the cost of training a more specific model?
emsign 26 minutes ago

An LLM does not understand what "user harm" is. This doesn't work.
[-]
- iamgioh 9 minutes ago
  
  Well, it's all about linguistic relativism, right? If you can define "user harm" in terms of things it does understand, I think you could get something that works
aryamanagraw 2 hours ago

We kept asking LLMs to rate things on 1-10 scales and getting inconsistent results. Turns out they're much better at arguing positions than assigning numbers— which makes sense given their training data. The courtroom structure (prosecution, defense, jury, judge) gave us adversarial checks we couldn't get from a single prompt. Curious if anyone has experimented with other domain-specific frameworks to scaffold LLM reasoning.
[-]
- deevelton an hour ago
  
  Experimented very briefly with a mediation (as opposed to a litigation) framework but it was pre-LLM and it was just a coding/learning experience: https://github.com/dvelton/hotseat-mediator
  Cool write-up of your experiment, thanks for sharing. Would be interesting to see how results from one framework (mediation, whose goal is "resolution") differ from the other (litigation, whose goal is, basically, "truth/justice").
  [-]
  - aryamanagraw 19 minutes ago
    
    That's really cool! That's actually the standpoint we started with. We asked what a collaborative reconciliation of document updates looks like. However, the LLMs seemed to get `swayed` or showed `bias` very easily. This brought up the point about an adversarial element. Even then, context engineering is your best friend.
    You kind of have to fine-tune what the objectives are for each persona and how much context they are entitled to, that would ensure an objective court proceeding that has debates in both directions carry equal weight!
    I love your point about incentivization. That seems to be a make-or-break element for a reasoning framework such as this.
- storystarling an hour ago
  
  The reasoning gains make sense but I am wondering about the production economics. Running four distinct agent roles per update seems like a huge multiplier on latency and token spend. Does the claimed efficiency actually offset the aggregate cost of the adversarial steps? Hard to see how the margins work out if you are quadrupling inference for every document change.
  [-]
  - aryamanagraw an hour ago
    
    The funnel is the answer to this. We're not running four agents on every PR — 65% are filtered before review even begins, and 95% of flagged PRs never reach the courtroom. This is because we do think there's some value in a single agent's judgment, and the prosecutor gets to make a choice when to file charges vs not.
    Only ~1-2% of PRs trigger the full adversarial pipeline. The courtroom is the expensive last mile, deliberately reserved for ambiguous cases where the cost of being wrong far exceeds the cost of a few extra inference calls. Plus you can make token/model-based optimizations for the extra calls in the argumentation system.
- thatjoeoverthr 2 hours ago
  
  If you do want a numeric scale, ask for a binary (e.g. true / false) and read the log probs.
  [-]
  - kyeb 2 hours ago
    
    (disclaimer: I work at Falconer)
    you would think so! but that's only optimal if the model already has all the information in recent context to make an optimally-informed decision.
    in practice, this is a neat context engineering trick, where the different LLM calls in the "courtroom" have different context and can contribute independent bits of reasoning to the overall "case"
  - aryamanagraw an hour ago
    
    That's the thing with documentation; there are hardly any situations where a simple true/false works. Product decisions have many caveats and evolving behaviors coming from different people. At that point, a numerical grading format isn't something we even want — we want reasoning, not ratings.