

KODAH - SWE-POLYBENCH VERIFIED (PYTHON)
Silas Liu - May 6, 2026
Automated Code Fixing, Large Language Models
Kodah ranks 4th in Python on SWE-PolyBench Verified.
113 issues. $4.60 total.
SWE-PolyBench Verified, created by Amazon Science, evaluates autonomous code repair across five production-grade Python repositories, from huggingface/transformers to langchain-ai/langchain. Kodah resolves 32.74% of 113 issues using GPT-5-mini, placing it among systems runninig on Claude Opus and GPT-5, while remaining the only entry in the ranking with a verified, publicly documented cost per issue.
Kodah at SWE-PolyBench: Elite Performance with Unprecedented Efficiency
Previously, I shared Kodah's results on SWE-bench Lite, where we demonstrated a solid ability to resolve real-world issues at scale. However, professional software development doesn't thrive on isolated cases; it demands consistency across massive, complex repositories and operational economics that make production use viable.
Today, we are raising the bar. I submitted Kodah to SWE-PolyBench Verified, a rigorous benchmark created by Amazon Science. It evaluates code repair capabilities in industrial-scale repositories such as Transformers (HuggingFace), Keras, and LangChain, using instances verified by human experts.
4th Place Globally in Python
Kodah is officially listed on the SWE-PolyBench Leaderboard.
View the full leaderboard: https://amazon-science.github.io/SWE-PolyBench/
Results are independently evaluated and publicly available.
The results confirm that Kodah is not just a promise, but a heavyweight competitor in the state of the art. In the Python ranking, Kodah secured the 4th position globally, competing directly with solutions from the world's largest Big Tech Labs and research institutions.
System | Python Resolve Rate |
|---|---|
Atlassian Rovo Dev | 54.87% |
PrometheusV1.2 + GPT-5 | 36.28% |
Amazon Q Developer Agent | 35.40% |
Kodah | 32.74% |
We are operating in a territory where only highly specialized systems can navigate, surpassing the vast majority of approaches available on the market.
Smarter Beats Brute Force:
The Efficiency Frontier
What makes this result exceptional is not just the resolution rate, but how we achieved it. While the ranking leaders utilize "High Reasoning" models (such as the full GPT-5 and Claude Opus) that demand massive computational and financial overhead, Kodah was evaluated using GPT-5-mini.
The philosophy behind Kodah is clear: intelligent design beats brute force. We delivered flagship-level results using a significantly lighter engine.
System | Resolve Rate | Estimated Cost | Justification |
|---|---|---|---|
Atlassian Rovo Dev | 54.87% | ~$1.75 | Public price Opus+GPT-5.2 × 400K tokens (literature) |
PrometheusV1.2 + GPT-5 | 36.28% | ~$0.75 | Public price GPT-5 × 300K tokens + overhead multi-agent |
Amazon Q Developer | 35.40% | ~$0.60 | $19/1.000 requests × 30 requests/issue (literature) |
Kodah | 32.74% | $0.045 | Actual Recorded Value |
*Competitor costs are estimates based on public model pricing and industry-measured token consumption in agentic coding sessions. Actual costs may vary.
Kodah is redefining the economics of autonomous software engineering. While industry-standard high-reasoning agents typically incur cost ranging from $0.50 to $2.00 per issue, driven by multi-pass executions and heavy reasoning tokens, Kodah achieves elite performance at an actual average cost of just $0.045 per task.
This represents an 11 to 40x reduction in operational overhead. By maintaining a lean operational footprint without sacrificing state-of-the-art accuracy, Kodah makes large-scale autonomous code repair not just technically viable, but also economically sustainable.
To put this into perspective, we resolved 113 complex instances with a total investment of just $4.60. This level of efficiency is what allows the system to move beyond a prototype as a production-ready solution for large-scale CI/CD pipelines.
Consistency across Industrial Repositories
The complexity of SWE-PolyBench Verified is a direct reflection of professional software engineering. Unlike benchmarks limited to isolated function repairs, these tasks involve multi-file dependencies and interconnected code modules within deeply nested directory structures. Resolving an issue in a repository like HuggingFace Transformers or Keras is a challenge of scale; a single fix often requires addressing logic that spans multiple files and classes.
Kodah's 32.7% resolution rate demonstrates its reliability within these high-entropy environments. Securing the 4th position globally confirms that the system can handle the structural overhead of production-grade code, where the primary hurdle is the sheer breadth and interconnectedness of the codebase.
Repository | Resolved | Rate | Avg Cost | File F1 |
|---|---|---|---|---|
langchain-ai/langchain | 5/13 | 38.5% | $0.027 | 0.82 |
keras-team/keras | 8/22 | 36.4% | $0.035 | 0.67 |
huggingface/transformers | 23/72 | 31.9% | $0.052 | 0.71 |
yt-dlp/yt-dlp | 1/5 | 20.0% | $0.048 | 0.41 |
Significant-Gravitas/AutoGPT | 0/1 | 0.0% | $0.021 | 0.00 |
The data from huggingface/transformers is particularly telling: with 72 issues representing 64% of the benchmark, it is a massive testing ground for any autonomous solution. Kodah maintained a 31.9% resolve rate at this scale.
Across the various repositories in the benchmark, costs remained remarkably consistent and controlled: averaging between $0.027 and $0.052 per repo issue. In langchain-ai/langchain, the system reached its highest fidelity, achieving a 38.5% resolve rate backed by an exceptional navigation F1 score of 0.82.
Precision as a Diagnostic Catalyst
Beyond end-to-end resolution, Kodah operates as a high-fidelity diagnostic layer. The file-level retrieval metrics reveal a clear surgical precision in bug localization.
Metric | Value |
|---|---|
Recall | 0.67 |
Precision | 0.77 |
F1 Score | 0.69 |
Recent analyses of code agent trajectories show a consistent pattern: high recall but extremely low precision during repository navigation, often below 0.15 in early file discovery stages, leading agents to inspect 8 to 12x more files than necessary[¹].
Most agents search wide. Kodah searches right.
With Precision (0.77) exceeding Recall (0.67), the system prioritizes correctness over coverage. This reduces noise and concentrates attention on the most relevant parts of the codebase.
In practice, this changes the workflow. Even when a full fix is not produced, Kodah reliably surfaces the right files with high confidence. Instead of spending time searching, engineers move directly to validation: reducing mean time to resolution (MTTR).
The future of code automation is not just in larger models. It is in systems that do more with less.
[¹] TRAJEVAL: Decomposing Code Agent Trajectories for Fine-Grained Diagnosis.
Mar/ 2026. Search Precision: 0.08-0.12; Read Precision: 0.04-0.06.