Building an AI That Knows When It's Right (And When It's Not)

What if an AI could not only write code to fix bugs, but also tell you when it's confident the fix will actually work?

We explored this challenge with Codev – an AI system that achieved 49% success on SWE-bench Lite (a benchmark of real GitHub issues) while maintaining an 82% correlation between its confidence and actual correctness.

The Problem with AI Code Generation

Most AI coding tools suffer from a fundamental issue: they confidently generate code even when they're wrong. This makes them difficult to trust in production environments where a bad fix can break systems for thousands of users.

We wanted to build something different – an AI that knows when it knows, and more importantly, when it doesn't.

How Codev Works

Instead of just generating a single solution, Codev follows a systematic approach:

Research First: It starts by understanding the codebase. It builds a map of how code connects, forms hypotheses about what might be wrong, and searches through potentially millions of lines of code to find the relevant pieces.

Generate Multiple Solutions: Rather than betting everything on one approach, Codev creates several different fixes and tests each one thoroughly.

Pick the Best: Finally, it evaluates all the candidates and selects the one it's most confident will work.

The Results That Matter

The numbers tell an interesting story:

93% of the time, Codev finds the right file to edit
49% of the time, it successfully fixes the actual problem
When Codev is highly confident, it's right 85-100% of the time
When it's uncertain, success drops to 14-35%

This confidence calibration is particularly valuable. It means you can trust high-confidence fixes while knowing to review the uncertain ones more carefully.

What This Means for Developers

An important aspect beyond the 49% success rate is the reliability of the confidence scores. This opens up practical applications:

Automated fixes for high-confidence issues
Focused code review on uncertain solutions
Smart escalation when the AI knows it's stuck

Imagine getting a pull request that says "I'm 95% confident this fixes the login bug" versus "I'm 30% confident this might help with the payment issue." You'd review them very differently.

The Bigger Picture

The main challenge we're tackling is making AI coding tools actually useful for development teams. Most AI tools today generate code confidently even when they're wrong, which makes them hard to trust in production.

By building AI that can assess its own confidence, we're trying to solve this trust problem. When the AI says it's confident, you can probably trust it. When it's uncertain, you know to look more carefully.

This could let teams use AI more strategically – auto-merge the high-confidence fixes, review the uncertain ones, and escalate the tricky cases. It's about making AI coding assistance practical rather than just impressive.

Read the full article at co-dev.ai