

KODAH
Silas Liu - April 29, 2026
Automated Code Fixing, Large Language Models
Kodah achieves 81% of the resolve rate of Claude Opus 4.6, the current SWE-bench Lite leader, while operating at 1/38th of the cost.
Kodah is an autonomous code repair system that challenges the dominant assumption in software engineering automation: that performance scales with compute. Using GPT-5-mini, a model that costs fractions of a cent per call, Kodah resolves 51% of SWE-bench Lite issues at an average cost of $0.045, delivering near-frontier results at a fraction of what today's leading agents spend.
Beating the Compute Curve: 51% on SWE-bench Lite at $0.045 per Issue
The dominant assumption in automated software engineering is that resolve rate scales with compute, that higher performance is a direct function of how much you spend at runtime. Today's top-tier autonomous agents and frontier models routinely cost between $0.85 and $1.70 per code repair, relying on massive token windows and brute-force reasoning to navigate complexity.
Kodah challenges this paradigm. I built a system that achieves elite performance using GPT-5-mini, a model that costs fractions of a cent per call.
Kodah resolves 51% of SWE-bench Lite issues at an average cost of $0.045 per issue.
System | Resolve Rate | Cost per Issue |
|---|---|---|
Claude Opus 4.6 (Thinking) | 62.7% | ~$1.70 |
GPT-5 | 54.3% | ~$1.25 |
Kodah | 51.0% | $0.045 |
Devin | 13.86% | $2.25/ACU |
Kodah reaches 81% of the resolve rate of the current frontier leader, Claude Opus 4.6, while operating at 1/38th of the cost. Against every system in this comparison, Kodah is the only one where performance and cost move in opposite directions: near-frontier results at sub-five-cent operational cost.
The total API cost to evaluate all 300 issues in the benchmark was just $13.59.
Metric | Value |
|---|---|
Benchmark | SWE-bench Lite (300 issues, 12 Python repos) |
Resolve Rate | 51.0% (153/300) |
Average cost per issue | $0.0453 |
Total cost (all 300) | $13.59 |
Robustness Across Architectures
The system maintains consistent performance across diverse codebases, proving that the efficiency of the architecture holds even as repository size and complexity increase.
Repository | Resolved | Rate | Avg Cost |
|---|---|---|---|
psf/requests | 6/6 | 100.0% | $0.049 |
django/django | 67/114 | 58.8% | $0.038 |
matplotlib/matplotlib | 13/23 | 56.5% | $0.038 |
scikit-learn/scikit-learn | 12/23 | 52.2% | $0.029 |
astropy/astropy | 3/6 | 50.0% | $0.064 |
mwaskom/seaborn | 2/4 | 50.0% | $0.084 |
sympy/sympy | 36/77 | 46.8% | $0.061 |
pydata/xarray | 2/5 | 40.0% | $0.075 |
pytest-dev/pytest | 6/17 | 35.3% | $0.040 |
pylint-dev/pylint | 2/6 | 33.3% | $0.048 |
sphinx-doc/sphinx | 4/16 | 25.0% | $0.039 |
pallets/flask | 0/3 | 0.0% | $0.028 |
Underlying model: GPT-5-mini, declared for benchmark transparency. Benchmark evaluation was conducted against the official SWE-bench Lite test harness. Results are verifiable against the public leaderboard.
Performance varies across repositories. requests (6/6) and django (67/114) demonstrate strong results on well-structured, widely-used codebases. On the other end, flask (0/3) and sphinx (4/16) highlight challenges in codebases with strong runtime coupling, while sympy (46.8%) shows degraded efficiency on repositories requiring domain-specific mathematical reasoning.
Cost Distribution: Scaling Without Complexity
The cost curve for Kodah is heavily right-skewed. While the industry standard assumes that complex issues require significantly more compute, our data shows that strong results do not require consistently high operational costs.
-
75% of issues cost less than $0.05.
-
90% of issues cost less than $0.10.
-
Fewer than 2% of issues exceeded $0.20.
Redefining the Economics of Engineering
The market today operates on two assumptions: that autonomous results require flagship-tier compute, or that lower costs require a human in the loop. Kodah operates outside both constraints: delivering autonomous, high-tier results at a sub-five-cent operational cost.
51% is an early milestone, not an endpoint. Significant headroom remains while maintaining the same cost discipline.