
CodeRabbit’s evaluation framework is multi-layered. First, offline metrics: “Is this model as good or directionally better than an existing model at finding issues from a recall precision perspective. Looking at the number of comments [the model] posted, how many were required before it found the same number of bugs?” Signal-to-noise ratio matters: “If it posted fewer comments but found the same number of issues, then we know that the signal to noise ratio has been improved.”
Next, qualitative review: “What do these comments look like, what’s the tone of them, how many of them are patches?” CodeRabbit even checks for hedging language. Loker notes that Sonnet 4.5, for example, will hedge and say something “might be an issue,” in which case his team will consider fixing it.
Then staged rollout: “We’ll branch out again and start rolling it more slowly. How are people perceiving it? And we’ll be watching. We’re watermarking to understand: Does this model achieve higher acceptance rates? Are people essentially abandoning the whole system as a result of this model?”
The GPT-5 launch was a case study in why this matters. “The expectations were pretty high because the recall rates were really good, and the various other metrics were really good. But ultimately, the latency was like, you know, that particular metric was kind of crazy.” CodeRabbit also found that even though GPT-5 pricing was cheaper than Sonnet 4.5, when it comes to its actual million token costs, it uses a lot more thinking tokens. “So the cost-benefit doesn’t really come into play,” Loker says.
The Outcome-Oriented Evaluation of AI Agents framework proposes measuring agents on 11 dimensions, including Goal Completion Rate, Autonomy Index, Multi-Step Task Resilience, and ROI, not just latency and throughput. The researchers’ finding: No single architecture dominates all dimensions. You must profile your use case and measure what matters for your domain.
If you go multi-agent, topology matters
CodeRabbit’s architecture is essentially a coordinated multi-agent system, with different models handling different review stages and a workflow orchestrating their interaction. “There’s right now two agentic loops,” Loker says. “One is before the big review with the heavier reasoning model, and then another one comes out afterward.”
When building multi-agent systems, the coordination topology measurably affects performance. Graph topology (agents communicate freely) outperforms tree, chain, and star (central coordinator) topologies for complex reasoning tasks. Adding an explicit “Plan how to collaborate” step before agents start working improves milestone achievement by +3% (MultiAgentBench). Default to graph for complex tasks. Star is simpler but weaker.
A checklist for agent builders
Building an agent? Here’s the order of operations, grounded in production experience and peer-reviewed research.
- Figure out what you can evaluate. This is, at a high level, the business value assessment, and at a lower level, how fast or how often a workflow solves the problem. It will never be 100%. Invest in a workflow forever with continuous rollouts.
- Map the workflow. Identify deterministic steps vs. steps that need judgment. Don’t make the whole thing agentic.
- Engineer your context. Assemble the right information for each step, not everything but not too little. Use structured, itemized context, not narrative blobs. Filter aggressively: Irrelevant context actively degrades performance, and better retrievers surface more dangerous distractors.
- Curate procedural knowledge carefully. Human-written, focused skills help enormously. But don’t let agents write their own playbooks. Keep them to two or three focused modules, and remember that comprehensive documentation hurts more than it helps.
- Choose models per step. Different steps, different models. Smaller/faster where you can, heavier where you must.
- Build tools deliberately. Discovery, selection, invocation, integration. Each stage needs its own error handling.
- Build memory with curation. Don’t just log but augment, structure, and filter what gets stored and retrieved.
- Verify your own output. Separate generation from verification. Use a different model or approach to check the output.
- Design feedback loops. Environmental signals, user feedback, cross-model critique. Design them in from day one.
- If multi-agent, think topology. Graph beats tree beats chain beats star. Plan collaboration explicitly.
The bottom line
The bottom line might be that we shouldn’t all be modeling Claude Code or OpenClaw and making a linear agent that manages its own context and does “whatever.” Instead we should be developing curated workflows with very specific tools. We should make sure the evaluations are well thought out for the overall workflow and are handled up front. In a month or three, the model and everything else will change, so evaluations are eternally useful.

