KODAH

Silas Liu - April 29, 2026

Automated Code Fixing, Large Language Models

Kodah achieves 81% of the resolve rate of Claude Opus 4.6, the current SWE-bench Lite leader, while operating at 1/38th of the cost.

Kodah is an autonomous code repair system that challenges the dominant assumption in software engineering automation: that performance scales with compute. Using GPT-5-mini, a model that costs fractions of a cent per call, Kodah resolves 51% of SWE-bench Lite issues at an average cost of $0.045, delivering near-frontier results at a fraction of what today's leading agents spend.

Beating the Compute Curve: 51% on SWE-bench Lite at $0.045 per Issue

The dominant assumption in automated software engineering is that resolve rate scales with compute, that higher performance is a direct function of how much you spend at runtime. Today's top-tier autonomous agents and frontier models routinely cost between $0.85 and $1.70 per code repair, relying on massive token windows and brute-force reasoning to navigate complexity.

Kodah challenges this paradigm. I built a system that achieves elite performance using GPT-5-mini, a model that costs fractions of a cent per call.

Kodah resolves 51% of SWE-bench Lite issues at an average cost of $0.045 per issue.

System	Resolve Rate	Cost per Issue
Claude Opus 4.6 (Thinking)	62.7%	~$1.70
GPT-5	54.3%	~$1.25
Kodah	51.0%	$0.045
Devin	13.86%	$2.25/ACU

Kodah reaches 81% of the resolve rate of the current frontier leader, Claude Opus 4.6, while operating at 1/38th of the cost. Against every system in this comparison, Kodah is the only one where performance and cost move in opposite directions: near-frontier results at sub-five-cent operational cost.

The total API cost to evaluate all 300 issues in the benchmark was just $13.59.

Metric	Value
Benchmark	SWE-bench Lite (300 issues, 12 Python repos)
Resolve Rate	51.0% (153/300)
Average cost per issue	$0.0453
Total cost (all 300)	$13.59

Robustness Across Architectures

The system maintains consistent performance across diverse codebases, proving that the efficiency of the architecture holds even as repository size and complexity increase.

Repository	Resolved	Rate	Avg Cost
psf/requests	6/6	100.0%	$0.049
django/django	67/114	58.8%	$0.038
matplotlib/matplotlib	13/23	56.5%	$0.038
scikit-learn/scikit-learn	12/23	52.2%	$0.029
astropy/astropy	3/6	50.0%	$0.064
mwaskom/seaborn	2/4	50.0%	$0.084
sympy/sympy	36/77	46.8%	$0.061
pydata/xarray	2/5	40.0%	$0.075
pytest-dev/pytest	6/17	35.3%	$0.040
pylint-dev/pylint	2/6	33.3%	$0.048
sphinx-doc/sphinx	4/16	25.0%	$0.039
pallets/flask	0/3	0.0%	$0.028

Underlying model: GPT-5-mini, declared for benchmark transparency. Benchmark evaluation was conducted against the official SWE-bench Lite test harness. Results are verifiable against the public leaderboard.

Performance varies across repositories. requests (6/6) and django (67/114) demonstrate strong results on well-structured, widely-used codebases. On the other end, flask (0/3) and sphinx (4/16) highlight challenges in codebases with strong runtime coupling, while sympy (46.8%) shows degraded efficiency on repositories requiring domain-specific mathematical reasoning.

Cost Distribution: Scaling Without Complexity

The cost curve for Kodah is heavily right-skewed. While the industry standard assumes that complex issues require significantly more compute, our data shows that strong results do not require consistently high operational costs.

75% of issues cost less than $0.05.
90% of issues cost less than $0.10.
Fewer than 2% of issues exceeded $0.20.

Redefining the Economics of Engineering

The market today operates on two assumptions: that autonomous results require flagship-tier compute, or that lower costs require a human in the loop. Kodah operates outside both constraints: delivering autonomous, high-tier results at a sub-five-cent operational cost.

51% is an early milestone, not an endpoint. Significant headroom remains while maintaining the same cost discipline.

kodah.io