top of page
big_logo_black.png

KODAH - SWE-POLYBENCH VERIFIED (JS/TS/JAVA)

Silas Liu - May 20, 2026

Automated Code Fixing, Large Language Models

Kodah ranks 4th globally across four languages. 3rd in TypeScript and JavaScript. 2nd in class-scope bugs.

SWE-PolyBench Verified, created by Amazon Science, evaluates autonomous code repair across four languages and 21 industrial-scale repositories. Kodah resolves 28.27% of 382 instances using GPT-5-mini, placing 4th globally and ahead of Amazon Q in both TypeScript and JavaScript, at one-tenth the operational cost of the field.

SWE-PolyBench Verified:
Why Multilingual Matters

Production software engineering rarely happens in a single language. TypeScript services call Java backends; JavaScript frontends interact with Python APIs; mobile clients and server infrastructure share business logic across ecosystems. Most autonomous code repair benchmarks evaluate against a single language, reflecting the tooling choices of their respective research communities rather than the composition of real codebases.

 

SWE-PolyBench Verified, a rigorous benchmark created by Amazon Science, is structured differently: four languages in one benchmark, evaluated against the same global field of competitors. Composed of human-verified instances, it uses industrial-scale repositories such as Microsoft VS Code, Svelte and Google Gson. The instances involve multi-file dependencies and nested directory structures that reflect the actual conditions of professional software development at scale.

This report covers Kodah's complete submission across all four languages. Results are independently verified and publicly available on the official leaderboard:

https://amazon-science.github.io/SWE-PolyBench/

4th Place Globally in the Multilingual Domain

System
Overall
Python
TypeScript
JavaScript
Java

Atlassian Rovo Dev

48.95%

54.87%

43.0%

50.0%

46.38%

PrometheusV1.2 + GPT-5

33.77%

36.28%

35.0%

30.0%

33.33%

Amazon Q Developer Agent

28.80%

35.40%

24.0%

20.0%

37.68%

Kodah (GPT-5-mini)

28.27%

32.74%

26.0%

25.0%

28.99%

The top of the SWE-PolyBench Verified leaderboard is occupied by enterprise-backed systems: Atlassian Rovo Dev, PrometheusV1.2 running on GPT-5, Amazon Q Developer Agent, each supported by dedicated research teams and flagship-tier compute.

Kodah places 4th overall at 28.27%, a margin of 0.53 percentage points behind Amazon Q Developer Agent's 3rd place result, and advances to 3rd in TypeScript and JavaScript, surpassing Amazon Q in both categories. In Java, Kodah holds 5th place at 28.99%, a result whose per-repository breakdown points directly to the next development priority.

3rd Place in TypeScript and JavaScript:

Ahead of Amazon Q

TypeScript and JavaScript are the most demanding categories on the leaderboard by result: the four competing systems produce their lowest resolve rates in these two languages, reflecting the challenges of runtime symbol resolution, implicit imports and a packaging ecosystem that varies significantly across frameworks and project configurations.

Kodah achieves 3rd place in both TypeScript and JavaScript, placing consistently ahead of Amazon Q in both, an enterprise-grade system backed by Amazon's research infrastructure. Holding this ranking across both members of JavaScript family, rather than in a single language, confirms that the result is a stable one rather than an artifact of instance selection or a favorable draw.

Repository
Resolved
Rate
Avg Cost
File F1

mui/material-ui

22/70

31.4%

$0.049

0.53

microsoft/vscode

4/23

17.4%

$0.111

0.42

coder/code-server

0/3

0.0%

$0.077

0.61

tailwindlabs/tailwindcss

0/3

0.0%

$0.014

0.69

angular/angular

0/1

0.0%

$0.338

0.00

TOTAL

26/100

26.0%

$0.065

0.51

Kodah places 3rd in TypeScript at 26.0%, ahead of Amazon Q. The result is built on a submission where 93 of the 100 instances are concentrated in two repositories with very different profiles. mui/material-ui is the most widely adopted React component library in production: 70 instances at 31.4% resolve rate. microsoft/vscode is one of the largest and most actively maintained open-source codebases in existence, with thousands of interdependent modules and a surface area that few systems navigate cleanly: 23 instances at 17.4% resolve rate. The remaining three repositories contribute a combined 7 instances, a sample size too small to be statistically meaningful. The 3rd place ranking holds across both ends of this spectrum.

Repository
Resolved
Rate
Avg Cost
File F1

mrdoob/three.js

3/4

75.0%

$0.031

0.65

prettier/prettier

5/17

29.4%

$0.095

0.60

serverless/serverless

8/33

24.2%

$0.046

0.55

sveltejs/svelte

9/46

19.6%

$0.056

0.47

TOTAL

25/100

25.0%

$0.059

0.53

Kodah placed 3rd in JavaScript at 25.0%, ahead of Amazon Q, the same positioning as TypeScript, across a completely different set of repositories. The volume here sits in sveltejs/svelte (46 instances) and serverless/serverless (33 instances), which together account for 79% of the submission. svelte is a compiler and component framework, a different class of codebase from a utility library, with bugs that frequently involve the transformation pipeline itself rather than isolated modules. Kodah resolves 9 of those 46 instances at 19.6%, while holding 24.2% across serverless and 29.4%  across prettier. mrdoob/three.js contributes only 4 instances and is not representative.

Java: A Competitive 5th Place Globally

Repository
Resolved
Rate
Avg Cost
File F1

apache/dubbo

6/15

40.0%

$0.115

0.66

google/gson

6/19

31.6%

$0.151

0.56

apache/rocketmq

4/18

22.2%

$0.153

0.65

trinodb/trino

2/11

18.2%

$0.082

0.38

apolloconfig/apollo

0/4

0.0%

$0.068

0.29

google/guava

2/2

100.0%

$0.064

0.67

TOTAL

20/69

28.99%

$0.125

0.56

Kodah holds the 5th place in Java at 28.99%, with the clearest per-repository pattern of the three languages. The repositories that resolve most frequently are the ones with well-bounded, self-contained fault domains: apache/dubbo at 40.0% across 15 instances, google/gson at 31.6% across 19 instances, google/guava at 100% across 2 instances. The repositories that resolve least are distributed infrastructure systems where issues span multiple abstraction layers across a wide codebase: apache/rocketmq at 22.2%, trinodb/trino at 18.2%. The pattern is consistent enough to be actionable, deeply layered distributed systems are the most clearly identified target for improvement in the next development cycle.

2nd Place in Class-Scope Bugs:

The Production-Critical Category

Beyond aggregate resolve rates, the complexity breakdown reveals where performance concentrates across the difficulty spectrum. In the Classes - Only and Classes - Single categories, bugs where both the fault and the required fix are confined to a single class, Kodah ranks 2nd globally in both.

Complexity Category
Atlassian Rovo Dev
PrometheusV1-2 + GPT-5
Amazon Q
Kodah
Rank

Functions — Single

56.05%

42.04%

31.85%

40.76%

3rd

Functions — Only

50.20%

36.95%

29.32%

30.92%

3rd

Classes — Single

100.00%

70.00%

50.00%

70.00%

2nd

Classes — Only

90.91%

63.64%

54.55%

72.73%

2nd

Classes — None

59.38%

46.88%

40.62%

34.38%

4th

Classes — Mixed

36.67%

16.67%

20.00%

13.33%

4th

In Classes - Single, Kodah reaches 70.0%, tying PrometheusV1.2 for 2nd place behind Atlassian Rovo Dev. In Classes - Only, the advantage is more pronounced: 72.73% against PrometheusV1.2's 63.64%, a clear 2nd place margin that also surpasses Amazon Q by 18 percentage points. These are the categories where Kodah's structural advantage over enterprise-grade systems is most legible in the data.

The class-scope stratification matters beyond benchmark rankings. Large-scale empirical analysis of real-world open-source projects consistently identifies semantic faults: logic errors encapsulated within class or module boundaries, as the dominant root cause of software failures, ahead of memory, configuration, or integration errors [¹]. The 2nd place result in both Classes - Only and Classes - Single reflects direct performance on the failure mode most frequently encountered in production OOP codebases.

What the Data Says

The multilingual submission averaged approximately $0.10 per issue across four languages, against a field where leading systems operate at estimated costs between $0.80 and $1.50 per issue, driven by frontier reasoning models and multi-pass execution pipelines [²]. Across four languages and six complexity categories, Kodah ranks 4th globally, competing directly against systems from Amazon, Atlassian, and frontier research institutions, and does so at one-tenth of their operational cost. That combination of ranking and cost is not a coincidence of the test set. It is a structural property.

Top 4 globally, across four languages, at one-tenth the cost of the field. This is the proof of architecture.

kodah.io

[¹] Tan, L., Liu, C., Li, Z., Wang, X., Zhou, Y., & Zhai, C. (2014). Bug characteristics in open source software. Empirical Software Engineering, 19(6), 1665–1705. https://doi.org/10.1007/s10664-013-9258-8.

[²] Cost estimates based on public API pricing and reported execution patterns, detailed in the prior submission.

bottom of page