None of these tools perform particularly well and all lack context to actually provide a meaningful review beyond what a linter would find, IMO. The SOTA isn't capable of using a code diff as a jumping off point.
Also the system prompts for some of them are kinda funny in a hopelessly naive aspirational way. We should all aspire to live and breathe the code review system prompt on a daily basis.
I would argue they go far beyond linters now, which was perhaps not true even nine months ago.
To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.
I fully agree. Claude’s review comments have been 50% useful, which is great. For comparison I have almost never found a useful TeamScale comment (classic static analyzer). Even more important, half of Claude’s good finds are orthogonal to those found by other human reviewers on our team. I.e. it points out things human reviewers miss consistently and v.v.
TBH that sounds like TeamScale just has too verbose default settings. On the other hand, people generally find almost all of the lints in Clippy's [1] default set useful, but if you enable "pedantic" lints, the signal-to-noise ratio starts getting worse – those generally require a more fine-grained setup, disabling and enabling individual lints to suit your needs.
People more often say that to save face by implying the issue you identified would be reasonable for the author to miss because it's subtle or tricky or whatever. It's often a proxy for embarrassment
> To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.
I mean how far Rusts own clippy lint went before any LLMs was actually insane.
Clippy + Rusts type system would basically ensure my software was working as close as possible to my spec before the first run. LLMs have greatly reduced the bar for bringing clippy quality linting to every language but at the cost of determinism.
Not trying to sidetrack, but a figure like that is data, not evidence. At the very minimum you need context which allows for interpretation; 9,078 positive author comments would be less impressive if Greptile made 1,000,000 comments in that time period, for example.
Where one of the filter calls is unnecessary... and it caught that across a call boundary.
So, I'd say that AI code reviews are better than a linter. There's still things that it fusses about because it doesn't know the full context of the application and the tables that make certain guarantees about the data, or code conventions for the team (in particular the use of internal terms within naming conventions).
I had a similar review by AI except my equivalent of setSomeData was stateful and needed to be there in both places, the AI just didn't understand any of it.
Then again, I have a rough idea on how I could implement this check with some (language-dependent) accuracy in a linter. With LLM's I... just hope and pray?
Opus 4.5 catches all sorts of things a linter would not, and with little manual prompting at that. Missing DB indexes, forgotten migration scenarios, inconsistencies with similar services, an overlooked edge case.
Now I'm getting a robot to review the branch at regular intervals and poking holes in my thinking. The trick is not to use an LLM as a confirmation machine.
It doesn't replace a human reviewer.
I don't see the point of paying for yet another CI integration doing LLM code review.
AI code review to me is similar to AI code itself. It's good (and constantly getting better) at dealing with mundane things, like - is the list reversed correctly? Are you dealing with pointers correctly? Do you have off by 1 issues?
Where they suck is high level problems like - is the code actually solving the business problem? Is it using right dependencies? Does it fit into broader design?
Which is expected for me and great help. I'm more happy as a human to spend less time checking if you're managing lifecycle of the pointer correctly and focus on ensuring that code is there to do what it needs to do.
I don't know that I fully agree with that. I use Copilot for AI code review - just because it's built in to GitHub and it's easy - and I'd say results are variable, but overall decent.
Like anything else AI you need to understand what you're doing, so you need to understand your code and the structure of your application or service or whatever because there are times it will say something that's just completely wide of the mark, or even the polar opposite of what's actually the case. And so you just ignore the crap and close the conversation in those situations.
At the same time, it does catch a lot of bugs and problems that fall into classes where more traditional linters really miss the mark. It can help fill holes in automated testing, spot security issues, etc., and it'll raise PRs for fixes that are generally decent. Sometimes not but, again, in these cases you just close them and move on.
I'd certainly say that an AI code review is better than no code review at all, so it's good for a startup where you might be the only developer or where there are only one or two of you and you don't cross over that much.
But the point I actually wanted to get to is this: I use Copilot because it's available as part of my GitHub subscription. Is it the best? I don't know. Does it add value with zero integration cost to me? Yes. And that, I suspect, is going to make it the default AI code review option for many GitHub subscribers.
That does leave me wondering how much of a future there is for AI code review as a product or service outside of the hosting platforms like GitHub and Gitlab, and I have to imagine that an absolutely savage consolidation is coming.
I installed CodeRabbit for our reviews in GitLab and am pretty happy with the results, especially considering the low price ($15/user/mo I think).
It regularly finds problems, including subtle but important problems that human reviewers struggle to find. And it can make pretty good suggestions for fixes.
It also regularly complains about things that are possible in theory but impossible in practice, so we've gotten used to just resolving those comments without any action. Maybe if we used types more effectively it would do that less.
We pay a lot more attention to what CodeRabbit says than what DeepSource said when use used it.
GH Copilot is definitely far better than just a linter. I don't have examples to hand but one thing that's stood out to me is its use of context outside the changes in the diff. It'll pull in context that typically isn't visible in the PR itself, the sort of things that only someone experienced in the code base with good recall would connect the dots on (e.g. this doesn't conform to typical patterns, or a version of this is already encapsulated in reusable code, or there's an existing constant that could be used here instead of the hardcoded value you have).
> The SOTA isn't capable of using a code diff as a jumping off point.
Not a jumping off point, but I'm having pretty great results on a complicated fork on a big project with a `git diff main..fork > main.diff`, then load in the specs I keep, and tell it to review the diff in chunks while updating a ./review.md
It's solving a problem I created myself by not reviewing some commits well enough, but it's surprisingly effective at picking up interactions spread out over multiple commits that might have slipped through regardless.
Anecdotally, Claude Bug Bot has actually been super impressive in understanding non trivial changes. Like, today, it noted a race condition in a ~1000 line go change that go test -race didnt pick up. There are definitely issues though. For one, it's non deterministic, so you end up with half a dozen commits, with each run noting different issues. For a second, it tends to be quite in favour of premature optimisation. But over all, well worth it in my experience
Problem with Code Review is it is quite straightforward to just prompt it, and the frontier models, whether Opus or GPT5.2Codex do a great job at code-reviews. I don't need second subscription or API call when the first one i already have and focus on integration works well out of the box.
In our case, agentastic.dev, we just baked the code-review right into our IDE. It just packages the diff for the agent, with some prompt, and sends it out to different agent choice (whether claude, codex) in parallel. The reason our users like it so much is because they don't need to pay extra for code-review anymore. Hard to beat free add-on, and cherry on top is you don't need to read a freaking poems.
I've also noticed this explosion of code review tools and felt that there's some misplaced focus going on for companies.
Two that stood out to me are Sentry and Vercel. Both have released code review tools recently and both feel misplaced. I can definitely see why they thought they could expand with that type of product offering but I just don't see a benefit over their competition. We have GH copilot natively available on all our PRs, it does a great job, integrates very well with the PR comment system, and is cheap (free with our current usage patterns). GH and other source control services are well placed to have first-class code review functionality baked into their PR tooling.
It's not really clear to me what Sentry/Vercel are offering beyond what copilot does and in my brief testing of them didn't see noticeable difference in quality or DX. Feels like they're fighting an uphill battle from day one with the product choice and are ultimately limited on DX by how deeply GH and other source control service allow them to integrate.
What I would love to see from Vercel, which they feel very well placed to offer, is AI powered QA. They already control the preview environments being deployed to for each PR, they have a feedback system in place with their Vercel toolbar comments, so they "just" need to tie those together with an agentic QA system. A much loftier goal of course but a differentiator and something I'm sure a lot of teams would pay top dollar for if it works well.
Greptile is a great product and I hope you succeed.
However, I disagree that independence is a competitive advantage. If it’s true that having a “firewall” between the coding agent and review agent leads to better code, I don’t see why a company like Cursor can’t create full independence between their coding and review products but still bundle them together for distribution.
Furthermore, there might well be benefits to not being fully independent. Imagine if an external auditor was brought in to review every decision made inside your company. There would likely be many things they simply don’t understand. Many decisions in code might seem irrational to an external standalone entity but make sense in the broader context of the organization’s goals. In this sense, I’m concerned that fully independent code review might miss the forest for the trees relative to a bundled product.
Again, I’m rooting for you guys. But I think this is food for thought.
It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways. Question: Would you trust a Cursor review of Claude-written code more, less, or the same as a Cursor review of Cursor-written code?
> Autonomy
Plenty of tools have invested heavily in AI-assisted review - creating great UIs to help human reviewers understand and check diffs. Our view is that code validation will be completely autonomous in the medium term, and so our system is designed to make all human intervention optional. This is possibly a unpopular opinion, and we respect the camp that might say people will always review AI-generated code. It's just not the future we want for this profession, nor the one we predict.
> Loops
You can invest in UX and tooling that makes this easier or harder. Our first step towards making this easier is a native Claude Code plugin in the `/plugins` command that let's Claude code do a plan, write, commit, get review comments, plan, write loop.
Independence is ridiculous - the underlying llm models are too similar on their training days and methodologies to be anything like independent. Trying different models may somewhat reduce the dependency, but all have read stack overflow, Reddit, and GitHub in their training.
It might be an interesting time to double down on automatically building and checking deterministic models of code which were previously too much of a pain to bother with. Eg, adding type checking to lazy python code. These types of checks really are model independent, and using agents to build and manage them might bring a lot of value.
> It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways.
Is there empirical evidence for that? Where is it on an epistemic meter between (1) “it sounds good when I say it”, and (10) “someone ran evaluation and got significant support.”
“Vibes” (2/3 on scale) are ok, just honestly curious.
> Today's agents are better than the median human code reviewer
"...at catching issues and enforcing standards, and they're only getting better".
I took this to mean what good code review is is subjective. But if you clearly define standards and patterns for your code, your linter/automated tools/ AI code reviewer will always catch more than humans.
We used Greptile where I work and it was so bad we decided to switch to Claude. And even Claude isn’t nearly as good at reviewing as an experienced programmer with domain knowledge.
My experience is that Claude or others are good at pointing out things I will want to look at and then I can go review more thoroughly. So it's helped to some degree.
But like everything else with it, it tries to do too much.
What I want is a review "wizard" agent -- something that identifies the pieces I should look at, and takes me through them diff by diff asking me to read them, while offering its commentary ("this appears to be XX....") and letting me make my own.
This article has a catchy headline, but there's really no content to it. This is content marketing without content. It seems like every week on Hacker News, there's a dozen of these. All seemingly code reviewers, too. Keep it to LinkedIn.
My company just finished a several week review period of Greptile. Devs were split over the usefulness of the tool (compared to our current solution, Cursor). While Greptile did occasionally offer better insights than Cursor, it also exhibited strange behavior such as entirely overwriting PR descriptions with its own text and occasionally arguing with itself in the comments. In the end we decided to NOT purchase Greptile as there were enough "not quite there" issues that made it more trouble than worthwhile. I am certain, though, that the Greptile team will resolve all those problems and I wish them the best of luck!
Good code reviews are part of team's culture and it's hard to just patch it with an agent. With millions of tools it will be arms race between which one is louder about as many things as possible because:
- it will have higher chance at convincing the author that the issue was important by throwing more darts - something that a human wouldn't do because it takes real mental effort to go through an authentic review,
- it will sometimes find real big issue which reinforces the bias that it's useful
- there will always be tendency towards more feedback (not higher quality) because if it's too silent, is it even doing anything?
So I believe it will just add more round of back and forth of prompting between more people, but not sure if net positive
Plus PRs are a good reality check if your code makes sense, when another person reviews it. A final safeguard before maintainability miss, or a disaster waiting to be deployed.
Why not let AI write the code and then have it reviewed by humans? If you use AI to review my code, then you can't stop me from using another AI to refute it: this only foreshadows the beginning of internal friction.
Maybe I'm buying into the cool-aid, but I actually really liked the self-aware tone of this post.
> Based on our benchmarks, we are uniquely good at catching bugs. However, if all company blogs are to be trusted, this is something we have in common with every other AI code review product. One just has to try a few, and pick the one that feels the best.
I liked that the post is self-aware that it's promoting its own product. But the writing seemed more focus on the philosophy behind code reviews and the impact of AI, and less on the mechanics of how greptile differs from competitors. I was hoping to see more on the latter.
"While some other products have built out great UIs for humans to review code in an AI-assisted paradigm, we have chosen to build for what we consider to be an inevitable future - one where code validation requires vanishingly little human participation."
Ok good, now I know not to bother reading through any of their marketing literature, because while the product at first interested me, now I know it's exactly not what I want for my team.
The actual "bubble" we have right now is a situation where people can produce and publish code they don't understand, and where engineers working on a system no longer are forced to reckon with and learn the intricacies of their system, and even senior engineers don't gain literacy into the very thing they're working on, and so are somewhat powerless to assess quality and deal with crisis when it hits.
The agentic coding tools and review tools I want my team (and myself) to have access to are ones that ones that force an explicit knowledge interview & acquisition process during authoring and involve the engineer more intricately in the whole flow.
What we got instead with claude code & friends is a thing way too eager to take over the whole thing. And while it can produce some good results it doesn't produce understandable systems.
To be clear, it's been a long time since writing code has been the hard part of the job? in many many domains. The hard part is systems & architecture and while these tools can help with that, there's nothing more potentially terrifying pthan a team full of people who have agentically produced a codebase that they cannot holistically understand the nuances of.
So, yeah, I want review tools for that scenario. Since these people have marketed themselves off the table... what is out there?
Contrary to some of the other anecdotes in this thread, I've found automated code review to discover some tricky stuff that humans missed. We use https://www.cubic.dev/
Before I push any code, I always ask 2 different frontier LLMs to review the changes for any potential issues. Saved my ass a few times before pushing to production.
>A human rubber-stamping code being validated by a super intelligent machine is the equivalent of a human sitting silently in the driver's seat of a self-driving car, "supervising".
So, absolutely necessary and essential?
In order to get the machine out of trouble when the unavoidable strange situation happens that didn't appear during training, and requires some judgement based on ethics or logical reasoning. For that case, you need a human in charge.
It's not terribly hard to write a Copilot GHA that does this yourself for your specific teams needs. Not sure why you'd been to bring a vendor on for this....
What do the vendors provide?
I looked at a couple which were pretty snazzy at first glance, but now that I know more about how copilot agents work and such, I'm pretty sure in a few hours, I could have the foundation for my team to build on that would take care of a lot of our PR review needs....
> Only once would you have X write a PR, then have X approve and merge it to realize the absurdity of what you just did.
I get the idea. I'll still throw out that having a single X go through the full workflow could still be useful in that there's an audit log, undo features (reverting a PR), notifications what have you. It's not equivalent to "human writes ticket, code deployed live" for that reason
My experience with code review tools has been dreadful. In most cases I can remember the reviews are inaccurate, "you are absolutely right" sycophantic garbage, or missing the big picture. The worst feature of all is the "PR summary" which is usually pure slop lacking the context around why a PR was made. Thankfully that can be turned off.
I have to be fair and say that yes, occasionally, some bug slips past the humans and is caught by the robot. But these bugs are usually also caught by automated unit/integration tests or by linters. All in all, you have to balance the occasional bug with all the time lost "reviewing the code review" to make sure the robot didn't just hallucinate something.
1. I absolutely agree there's a bubble. Everybody is shipping a code review agent.
2. What on earth is this defense of their product? I could see so many arguments for why their code reviewer is the best, and this contains none of them.
More broadly, though, if you've gotten to the point where you're relying on AI code review to catch bugs, you've lost the plot.
The point of a PR is to share knowledge and to catch structural gaps. Bug-finding is a bonus. Catching bugs, automated self-review, structuring your code to be sensible: that's _your_ job. Write the code to be as sensible as possible, either by yourself or with an AI. Get the review because you work on a team, not in a vacuum.
2. There is plenty of evidence for this elsewhere on the site, and we do encourage people to try it because like with a lot of AI tools, YMMV.
You're totally right that PR reviews go a lot farther than catching issues and enforcing standard. Knowledge sharing is a very important part of it. However, there are processes you can create to enable better knowledge sharing and let AI handle the issue-catching (maybe not fully yet, but in time). Blocking code from merging because knowledge isn't shared yet seems unnecessary.
> 2. What on earth is this defense of their product?
i think the distribution channel is the only defensive moat in low-to-mid-complexity fast-to-implement features like code-review agents. So in case of linear and cursor-bugbot it make a lot of sense. I wonder when Github/Gitlab/Atlassian or Xcode will release their own review agent.
> More broadly, though, if you've gotten to the point where you're relying on AI code review to catch bugs, you've lost the plot.
> The point of a PR is to share knowledge and to catch structural gaps.
Well, it was to share knowledge and to catch structural gaps.
Now you have an idea, for better or for worse, that software needs to be developed AI-first. That's great for the creation of new code but as we all know, it's almost guaranteed that you'll get some bad output from the AI that you used to generate the code, and since it can generate code very fast, you have a lot of it to go through, especially if you're working on a monorepo that wasn't architected particularly well when it was written years ago.
PRs seem like an almost natural place to do this. The alternative is the industry finding a more appropriate place to do this sort of thing in the SDLC, which is gonna take time, seeing as how agentic loop software development is so new.
Reminder that this comes from from the founder that got rightly lambasted for his comments about work life balance and then doubled down when called out.
No shit. What is the point of using an llm model to review code produced by an llm model?
Code review pressupose a different perspective, which no platform can offer at the moment because they are just as sophisticated as the model they wrap.
Claude generated the code, and Claude was asked if the code was good enough, and now you want to be in the middle to ask Claude again but with more emphasis, I guess?
If I want more emphasis I can ask Claude myself. Or Qwen.
I can't even begin to understand this rationale.
None of these tools perform particularly well and all lack context to actually provide a meaningful review beyond what a linter would find, IMO. The SOTA isn't capable of using a code diff as a jumping off point.
Also the system prompts for some of them are kinda funny in a hopelessly naive aspirational way. We should all aspire to live and breathe the code review system prompt on a daily basis.
I agree that none perform _super_ well.
I would argue they go far beyond linters now, which was perhaps not true even nine months ago.
To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.
I fully agree. Claude’s review comments have been 50% useful, which is great. For comparison I have almost never found a useful TeamScale comment (classic static analyzer). Even more important, half of Claude’s good finds are orthogonal to those found by other human reviewers on our team. I.e. it points out things human reviewers miss consistently and v.v.
TBH that sounds like TeamScale just has too verbose default settings. On the other hand, people generally find almost all of the lints in Clippy's [1] default set useful, but if you enable "pedantic" lints, the signal-to-noise ratio starts getting worse – those generally require a more fine-grained setup, disabling and enabling individual lints to suit your needs.
[1] https://doc.rust-lang.org/stable/clippy/
People more often say that to save face by implying the issue you identified would be reasonable for the author to miss because it's subtle or tricky or whatever. It's often a proxy for embarrassment
> To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.
do you have a bot to do this too?
I mean how far Rusts own clippy lint went before any LLMs was actually insane.
Clippy + Rusts type system would basically ensure my software was working as close as possible to my spec before the first run. LLMs have greatly reduced the bar for bringing clippy quality linting to every language but at the cost of determinism.
That sounds more like confirmation that greptile is being included in a lot agentic coding loops than anything
Not trying to sidetrack, but a figure like that is data, not evidence. At the very minimum you need context which allows for interpretation; 9,078 positive author comments would be less impressive if Greptile made 1,000,000 comments in that time period, for example.
over 7 days does contextualize it some, though.
9,078 comments / 7 (days) / 8 (hours) = 162.107 though, so if human that person is making 162 comments an hour, 8 hours a day, 7 days a week?
In some code that I was working on, I had
The 'something' was a little bit more complex, but it was the same something with slightly different formatting.My linter didn't catch the repeat call. When asking the AI chat for a review of the code changes it did correctly flag that there was a repeat call.
It also caught a repeat call in
Where one of the filter calls is unnecessary... and it caught that across a call boundary.So, I'd say that AI code reviews are better than a linter. There's still things that it fusses about because it doesn't know the full context of the application and the tables that make certain guarantees about the data, or code conventions for the team (in particular the use of internal terms within naming conventions).
Why isn’t `obj` immutable?
I had a similar review by AI except my equivalent of setSomeData was stateful and needed to be there in both places, the AI just didn't understand any of it.
When this happens to me it makes me question my design.
If the AI doesn’t understand it, chances are it’s counter-intuitive. Of course not all LLM’s are equal, etc, etc.
Then again, I have a rough idea on how I could implement this check with some (language-dependent) accuracy in a linter. With LLM's I... just hope and pray?
I'd say I see one anecdote, nothing to draw conclusions from.
Unit tests catch that kind of stuff
youre verifying std lib function call counts in unit tests? lmao.
Opus 4.5 catches all sorts of things a linter would not, and with little manual prompting at that. Missing DB indexes, forgotten migration scenarios, inconsistencies with similar services, an overlooked edge case.
Now I'm getting a robot to review the branch at regular intervals and poking holes in my thinking. The trick is not to use an LLM as a confirmation machine.
It doesn't replace a human reviewer.
I don't see the point of paying for yet another CI integration doing LLM code review.
You’ve found the smoking gun!
AI code review to me is similar to AI code itself. It's good (and constantly getting better) at dealing with mundane things, like - is the list reversed correctly? Are you dealing with pointers correctly? Do you have off by 1 issues?
Where they suck is high level problems like - is the code actually solving the business problem? Is it using right dependencies? Does it fit into broader design?
Which is expected for me and great help. I'm more happy as a human to spend less time checking if you're managing lifecycle of the pointer correctly and focus on ensuring that code is there to do what it needs to do.
I don't know that I fully agree with that. I use Copilot for AI code review - just because it's built in to GitHub and it's easy - and I'd say results are variable, but overall decent.
Like anything else AI you need to understand what you're doing, so you need to understand your code and the structure of your application or service or whatever because there are times it will say something that's just completely wide of the mark, or even the polar opposite of what's actually the case. And so you just ignore the crap and close the conversation in those situations.
At the same time, it does catch a lot of bugs and problems that fall into classes where more traditional linters really miss the mark. It can help fill holes in automated testing, spot security issues, etc., and it'll raise PRs for fixes that are generally decent. Sometimes not but, again, in these cases you just close them and move on.
I'd certainly say that an AI code review is better than no code review at all, so it's good for a startup where you might be the only developer or where there are only one or two of you and you don't cross over that much.
But the point I actually wanted to get to is this: I use Copilot because it's available as part of my GitHub subscription. Is it the best? I don't know. Does it add value with zero integration cost to me? Yes. And that, I suspect, is going to make it the default AI code review option for many GitHub subscribers.
That does leave me wondering how much of a future there is for AI code review as a product or service outside of the hosting platforms like GitHub and Gitlab, and I have to imagine that an absolutely savage consolidation is coming.
I installed CodeRabbit for our reviews in GitLab and am pretty happy with the results, especially considering the low price ($15/user/mo I think).
It regularly finds problems, including subtle but important problems that human reviewers struggle to find. And it can make pretty good suggestions for fixes.
It also regularly complains about things that are possible in theory but impossible in practice, so we've gotten used to just resolving those comments without any action. Maybe if we used types more effectively it would do that less.
We pay a lot more attention to what CodeRabbit says than what DeepSource said when use used it.
GH Copilot is definitely far better than just a linter. I don't have examples to hand but one thing that's stood out to me is its use of context outside the changes in the diff. It'll pull in context that typically isn't visible in the PR itself, the sort of things that only someone experienced in the code base with good recall would connect the dots on (e.g. this doesn't conform to typical patterns, or a version of this is already encapsulated in reusable code, or there's an existing constant that could be used here instead of the hardcoded value you have).
> The SOTA isn't capable of using a code diff as a jumping off point.
Not a jumping off point, but I'm having pretty great results on a complicated fork on a big project with a `git diff main..fork > main.diff`, then load in the specs I keep, and tell it to review the diff in chunks while updating a ./review.md
It's solving a problem I created myself by not reviewing some commits well enough, but it's surprisingly effective at picking up interactions spread out over multiple commits that might have slipped through regardless.
Anecdotally, Claude Bug Bot has actually been super impressive in understanding non trivial changes. Like, today, it noted a race condition in a ~1000 line go change that go test -race didnt pick up. There are definitely issues though. For one, it's non deterministic, so you end up with half a dozen commits, with each run noting different issues. For a second, it tends to be quite in favour of premature optimisation. But over all, well worth it in my experience
Problem with Code Review is it is quite straightforward to just prompt it, and the frontier models, whether Opus or GPT5.2Codex do a great job at code-reviews. I don't need second subscription or API call when the first one i already have and focus on integration works well out of the box.
In our case, agentastic.dev, we just baked the code-review right into our IDE. It just packages the diff for the agent, with some prompt, and sends it out to different agent choice (whether claude, codex) in parallel. The reason our users like it so much is because they don't need to pay extra for code-review anymore. Hard to beat free add-on, and cherry on top is you don't need to read a freaking poems.
I've also noticed this explosion of code review tools and felt that there's some misplaced focus going on for companies.
Two that stood out to me are Sentry and Vercel. Both have released code review tools recently and both feel misplaced. I can definitely see why they thought they could expand with that type of product offering but I just don't see a benefit over their competition. We have GH copilot natively available on all our PRs, it does a great job, integrates very well with the PR comment system, and is cheap (free with our current usage patterns). GH and other source control services are well placed to have first-class code review functionality baked into their PR tooling.
It's not really clear to me what Sentry/Vercel are offering beyond what copilot does and in my brief testing of them didn't see noticeable difference in quality or DX. Feels like they're fighting an uphill battle from day one with the product choice and are ultimately limited on DX by how deeply GH and other source control service allow them to integrate.
What I would love to see from Vercel, which they feel very well placed to offer, is AI powered QA. They already control the preview environments being deployed to for each PR, they have a feedback system in place with their Vercel toolbar comments, so they "just" need to tie those together with an agentic QA system. A much loftier goal of course but a differentiator and something I'm sure a lot of teams would pay top dollar for if it works well.
Greptile is a great product and I hope you succeed.
However, I disagree that independence is a competitive advantage. If it’s true that having a “firewall” between the coding agent and review agent leads to better code, I don’t see why a company like Cursor can’t create full independence between their coding and review products but still bundle them together for distribution.
Furthermore, there might well be benefits to not being fully independent. Imagine if an external auditor was brought in to review every decision made inside your company. There would likely be many things they simply don’t understand. Many decisions in code might seem irrational to an external standalone entity but make sense in the broader context of the organization’s goals. In this sense, I’m concerned that fully independent code review might miss the forest for the trees relative to a bundled product.
Again, I’m rooting for you guys. But I think this is food for thought.
I don't really understand how this differentiates against the competition.
> Independence
Any "agent" running against code review instead of code generation is "independent"?
> Autonomy
Most other code review tools can also be automated and integrated.
> Loops
You can also ping other code review tools for more reviews...
I feel like this article actually works against you by presenting the problem and inadequately solving them.
> Independence
It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways. Question: Would you trust a Cursor review of Claude-written code more, less, or the same as a Cursor review of Cursor-written code?
> Autonomy
Plenty of tools have invested heavily in AI-assisted review - creating great UIs to help human reviewers understand and check diffs. Our view is that code validation will be completely autonomous in the medium term, and so our system is designed to make all human intervention optional. This is possibly a unpopular opinion, and we respect the camp that might say people will always review AI-generated code. It's just not the future we want for this profession, nor the one we predict.
> Loops
You can invest in UX and tooling that makes this easier or harder. Our first step towards making this easier is a native Claude Code plugin in the `/plugins` command that let's Claude code do a plan, write, commit, get review comments, plan, write loop.
Independence is ridiculous - the underlying llm models are too similar on their training days and methodologies to be anything like independent. Trying different models may somewhat reduce the dependency, but all have read stack overflow, Reddit, and GitHub in their training.
It might be an interesting time to double down on automatically building and checking deterministic models of code which were previously too much of a pain to bother with. Eg, adding type checking to lazy python code. These types of checks really are model independent, and using agents to build and manage them might bring a lot of value.
> It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways.
Is there empirical evidence for that? Where is it on an epistemic meter between (1) “it sounds good when I say it”, and (10) “someone ran evaluation and got significant support.”
“Vibes” (2/3 on scale) are ok, just honestly curious.
> Unfortunately, code review performance is ephemeral and subjective
> Today's agents are better than the median human code reviewer
Which is it? You cannot have it both ways.
> Today's agents are better than the median human code reviewer
"...at catching issues and enforcing standards, and they're only getting better".
I took this to mean what good code review is is subjective. But if you clearly define standards and patterns for your code, your linter/automated tools/ AI code reviewer will always catch more than humans.
We used Greptile where I work and it was so bad we decided to switch to Claude. And even Claude isn’t nearly as good at reviewing as an experienced programmer with domain knowledge.
My experience is that Claude or others are good at pointing out things I will want to look at and then I can go review more thoroughly. So it's helped to some degree.
But like everything else with it, it tries to do too much.
What I want is a review "wizard" agent -- something that identifies the pieces I should look at, and takes me through them diff by diff asking me to read them, while offering its commentary ("this appears to be XX....") and letting me make my own.
This article has a catchy headline, but there's really no content to it. This is content marketing without content. It seems like every week on Hacker News, there's a dozen of these. All seemingly code reviewers, too. Keep it to LinkedIn.
My company just finished a several week review period of Greptile. Devs were split over the usefulness of the tool (compared to our current solution, Cursor). While Greptile did occasionally offer better insights than Cursor, it also exhibited strange behavior such as entirely overwriting PR descriptions with its own text and occasionally arguing with itself in the comments. In the end we decided to NOT purchase Greptile as there were enough "not quite there" issues that made it more trouble than worthwhile. I am certain, though, that the Greptile team will resolve all those problems and I wish them the best of luck!
Good code reviews are part of team's culture and it's hard to just patch it with an agent. With millions of tools it will be arms race between which one is louder about as many things as possible because:
- it will have higher chance at convincing the author that the issue was important by throwing more darts - something that a human wouldn't do because it takes real mental effort to go through an authentic review,
- it will sometimes find real big issue which reinforces the bias that it's useful
- there will always be tendency towards more feedback (not higher quality) because if it's too silent, is it even doing anything?
So I believe it will just add more round of back and forth of prompting between more people, but not sure if net positive
Plus PRs are a good reality check if your code makes sense, when another person reviews it. A final safeguard before maintainability miss, or a disaster waiting to be deployed.
Why not let AI write the code and then have it reviewed by humans? If you use AI to review my code, then you can't stop me from using another AI to refute it: this only foreshadows the beginning of internal friction.
Maybe I'm buying into the cool-aid, but I actually really liked the self-aware tone of this post.
> Based on our benchmarks, we are uniquely good at catching bugs. However, if all company blogs are to be trusted, this is something we have in common with every other AI code review product. One just has to try a few, and pick the one that feels the best.
I liked that the post is self-aware that it's promoting its own product. But the writing seemed more focus on the philosophy behind code reviews and the impact of AI, and less on the mechanics of how greptile differs from competitors. I was hoping to see more on the latter.
Thanks! We go over that on many other pages. Here are some:
https://www.greptile.com/benchmarks https://www.greptile.com/greptile-vs-coderabbit https://www.greptile.com/greptile-vs-bugbot
"While some other products have built out great UIs for humans to review code in an AI-assisted paradigm, we have chosen to build for what we consider to be an inevitable future - one where code validation requires vanishingly little human participation."
Ok good, now I know not to bother reading through any of their marketing literature, because while the product at first interested me, now I know it's exactly not what I want for my team.
The actual "bubble" we have right now is a situation where people can produce and publish code they don't understand, and where engineers working on a system no longer are forced to reckon with and learn the intricacies of their system, and even senior engineers don't gain literacy into the very thing they're working on, and so are somewhat powerless to assess quality and deal with crisis when it hits.
The agentic coding tools and review tools I want my team (and myself) to have access to are ones that ones that force an explicit knowledge interview & acquisition process during authoring and involve the engineer more intricately in the whole flow.
What we got instead with claude code & friends is a thing way too eager to take over the whole thing. And while it can produce some good results it doesn't produce understandable systems.
To be clear, it's been a long time since writing code has been the hard part of the job? in many many domains. The hard part is systems & architecture and while these tools can help with that, there's nothing more potentially terrifying pthan a team full of people who have agentically produced a codebase that they cannot holistically understand the nuances of.
So, yeah, I want review tools for that scenario. Since these people have marketed themselves off the table... what is out there?
Contrary to some of the other anecdotes in this thread, I've found automated code review to discover some tricky stuff that humans missed. We use https://www.cubic.dev/
Before I push any code, I always ask 2 different frontier LLMs to review the changes for any potential issues. Saved my ass a few times before pushing to production.
Founder of cubic here, thanks for the shoutout!
>A human rubber-stamping code being validated by a super intelligent machine is the equivalent of a human sitting silently in the driver's seat of a self-driving car, "supervising".
So, absolutely necessary and essential?
In order to get the machine out of trouble when the unavoidable strange situation happens that didn't appear during training, and requires some judgement based on ethics or logical reasoning. For that case, you need a human in charge.
It's not terribly hard to write a Copilot GHA that does this yourself for your specific teams needs. Not sure why you'd been to bring a vendor on for this....
What do the vendors provide?
I looked at a couple which were pretty snazzy at first glance, but now that I know more about how copilot agents work and such, I'm pretty sure in a few hours, I could have the foundation for my team to build on that would take care of a lot of our PR review needs....
> Only once would you have X write a PR, then have X approve and merge it to realize the absurdity of what you just did.
I get the idea. I'll still throw out that having a single X go through the full workflow could still be useful in that there's an audit log, undo features (reverting a PR), notifications what have you. It's not equivalent to "human writes ticket, code deployed live" for that reason
My experience with code review tools has been dreadful. In most cases I can remember the reviews are inaccurate, "you are absolutely right" sycophantic garbage, or missing the big picture. The worst feature of all is the "PR summary" which is usually pure slop lacking the context around why a PR was made. Thankfully that can be turned off.
I have to be fair and say that yes, occasionally, some bug slips past the humans and is caught by the robot. But these bugs are usually also caught by automated unit/integration tests or by linters. All in all, you have to balance the occasional bug with all the time lost "reviewing the code review" to make sure the robot didn't just hallucinate something.
Claude code's code review is _sufficient_ imo.
still need HITL, but the human is shifted right and can do other things rather than grinding through fiddly details.
As Claude Code (and Opus) improves, Greptile is finding fewer issues in my code reviews.
So far I've been pretty happy with Greptile. Tried Copilot and Cubic.dev but landed on Greptile
1. I absolutely agree there's a bubble. Everybody is shipping a code review agent.
2. What on earth is this defense of their product? I could see so many arguments for why their code reviewer is the best, and this contains none of them.
More broadly, though, if you've gotten to the point where you're relying on AI code review to catch bugs, you've lost the plot.
The point of a PR is to share knowledge and to catch structural gaps. Bug-finding is a bonus. Catching bugs, automated self-review, structuring your code to be sensible: that's _your_ job. Write the code to be as sensible as possible, either by yourself or with an AI. Get the review because you work on a team, not in a vacuum.
2. There is plenty of evidence for this elsewhere on the site, and we do encourage people to try it because like with a lot of AI tools, YMMV.
You're totally right that PR reviews go a lot farther than catching issues and enforcing standard. Knowledge sharing is a very important part of it. However, there are processes you can create to enable better knowledge sharing and let AI handle the issue-catching (maybe not fully yet, but in time). Blocking code from merging because knowledge isn't shared yet seems unnecessary.
> 2. What on earth is this defense of their product?
i think the distribution channel is the only defensive moat in low-to-mid-complexity fast-to-implement features like code-review agents. So in case of linear and cursor-bugbot it make a lot of sense. I wonder when Github/Gitlab/Atlassian or Xcode will release their own review agent.
> More broadly, though, if you've gotten to the point where you're relying on AI code review to catch bugs, you've lost the plot.
> The point of a PR is to share knowledge and to catch structural gaps.
Well, it was to share knowledge and to catch structural gaps.
Now you have an idea, for better or for worse, that software needs to be developed AI-first. That's great for the creation of new code but as we all know, it's almost guaranteed that you'll get some bad output from the AI that you used to generate the code, and since it can generate code very fast, you have a lot of it to go through, especially if you're working on a monorepo that wasn't architected particularly well when it was written years ago.
PRs seem like an almost natural place to do this. The alternative is the industry finding a more appropriate place to do this sort of thing in the SDLC, which is gonna take time, seeing as how agentic loop software development is so new.
Reminder that this comes from from the founder that got rightly lambasted for his comments about work life balance and then doubled down when called out.
There is an AI bubble.
Can drop the extra words
one more ai code review please, I promise it will fix everything this time, please just one more
No shit. What is the point of using an llm model to review code produced by an llm model?
Code review pressupose a different perspective, which no platform can offer at the moment because they are just as sophisticated as the model they wrap. Claude generated the code, and Claude was asked if the code was good enough, and now you want to be in the middle to ask Claude again but with more emphasis, I guess? If I want more emphasis I can ask Claude myself. Or Qwen. I can't even begin to understand this rationale.