Trae Achieves #1 on SWE-bench Verified with Claude 3.7 | Trae

We are thrilled to announce that Trae, the real AI engineer, has achieved #1 with a score of 70.6% on the SWE-bench Verified when evaluated with Claude 3.7 and next only to Tools when the leaderboard includes results from Claude 4.

1. Single Attempt Generation

We provided the agent with the following four tools:

str_replace_editor : Enables the Agent to browse files, edit code, etc.
Bash : Allows the Agent to execute any command.
ckg_tools : Builds a Code Knowledge Graph (CKG) for the code repository, enabling the Agent to efficiently perform search_class and search_function operations.
sequential_thinking_tool : Facilitates step-by-step reasoning for the Agent.

We compared the performance of different models on end-to-end generation for a single attempt. The experimental results are shown in Figure 1.

The resolve rate of Claude-3.7-Sonnet ranges from 60.6% to 62.6%
The resolve rate of Gemini-2.5-Pro-0506 ranges from 52.4% to 55%
The resolve rate of OpenAI o4-mini ranges from 54.4% to 55.8%

In the end-to-end experiment with one attempt, the model performance ranking is: Claude-3.7-Sonnet > OpenAI o4-mini > Gemini-2.5-Pro-0506.

Figure 1: Resolve Rate Comparison by Model

2. Selection from Multiple Patches

2.1 LLM-as-a-Selector

We first tried a simple selection method. Initially, we integrated the Agentless regression testing module. Candidate patches that failed the regression tests were excluded from the subsequent selection process. Then, we used the selection approach from the open-source project Augment swebench Agent, employing OpenAI o1 for the selection. We refer to this method as LLM-as-a-Selector.

Figure 2 shows how the success rate of LLM-as-a-Selector and the Optimal Rate (direct union of candidates) changes as the sampling space (number of candidate patches) increases. We can make the following observations:

LLM-as-a-Selector does improve performance for a single attempt, which validates the existence of the test-time scaling law.
Although the Optimal Rate steadily increases, the performance of LLM-as-a-Selector peaks when the sampling space is 5 or 6, then starts to decline. At sampling sizes of 8 and 9, its performance is even worse than at sampling sizes of 3 or 4.

Figure 2: Resolve Rate vs. Sample Space Size

Based on the above findings, our goal is to maintain the effect of the test-time scaling law while increasing the sampling space. Therefore, we devised the following selection method.

2.2 Selector Agent

Figure 3: Selector Agent Overview

As shown in Figure 3, our approach consists of three main stages:

Generation Stage: In this stage, we leverage a Coder agent to generate candidate patches based on the issue description. To ensure diversity among the generated candidate patches, we employ multiple popular LLMs (Claude 3.7 Sonnet, Gemini 2.5 Pro, and o4 mini) as the Coder.
Filtering Stage: Inspired by Agentless, we design a similar component called the Tester agent, which filters out incorrect patches by leveraging regression tests. Specifically, the Tester agent automatically retrieves a subset of regression tests from the original project codebase that are relevant to the given issue description. It is important to note that we strictly adhere to the submission constraints of SWE-bench and do not utilize any hidden test knowledge (e.g., PASS_TO_PASS or FAIL_TO_PASS). The Tester agent first runs the retrieved regression tests on the original codebase to ensure that they execute correctly in the original codebase. It then runs regression tests on the candidate patches and eliminates those that fail. The remaining filtered patches, which pass all the selected regression tests, proceed to the next stage. It is worth mentioning that passing all regression tests does not guarantee the correctness of a patch, as some regression tests may themselves require modification. Likewise, a correct patch may fail certain regression tests. Nonetheless, our experimental results show that patches passing all regression tests are more likely to be correct, which justifies the use of this filtering strategy. If none of the candidate patches pass the regression tests, we conservatively retain all candidates for the next stage.
Voting Stage: Finally, the filtered candidate patches are passed to the Selector agent, which determines the final patch. To better illustrate the internal workings of the Selector agent, we present its detailed workflow in Figure 4.
1. Syntax-Based Voting: The Selector agent first performs syntax-based voting by clustering the candidate patches based on syntactic equivalence and selecting the most frequent cluster as the potential solution. The rationale is that if multiple Coder agents independently generate strictly syntactically equivalent patches, it indicates a strong consensus, suggesting that these highly-consistent patches are more likely to be correct. To determine syntactic equivalence, each patch is applied to the original codebase to produce an updated version. Then, we use Tree-sitter to parse each updated code version into an AST, stripping out all comments. The voting result reflects the degree of consensus among different Coder agents and is treated as a strong correctness signal. However, to further validate the selected patch, we employ a dual-verification mechanism within the Selector agent. Leveraging both contextual information (i.e., the issue description and project codebase) and four tools (i.e., str_replace_editor, Bash, ckg_tools, and sequential_thinking_tool), the Selector agent verifies whether the syntax-voted patch behaves as expected. If it passes this dual-verification, it is returned as the final patch. If the Selector agent remains uncertain about the correctness of the patch, the result from syntax-based voting is discarded, and the process proceeds to the next phase: multi-selection voting.
2. Multi-Selection Voting: To further reduce the search space of candidate patches, we first deduplicate them using the AST representations obtained in the previous step, retaining only syntactically distinct patches. This deduplication mitigates two issues: (i) repeated patches may introduce bias during the selection process by overwhelming the input distribution, and (ii) excessive repetition unnecessarily increases the input token length. Each Selector agent is then tasked with selecting the most likely correct patch from the deduplicated set. The final patch is chosen by aggregating the votes from multiple Selector agents, with the patch receiving the highest number of votes being returned as the final output. Note that if the votes remain evenly distributed across all candidate patches—indicating a lack of consensus—the voting is considered unsuccessful. In this case, we iteratively increase the number of multi-selection voting rounds until a patch emerges with the highest number of votes, which is then selected as the final output.

Figure 4: Details of the Selector Agent

Final Results: Our approach raises the overall success rate on the SWE-bench-verified benchmark to 70.6% 🎉. Encouragingly, the current approach demonstrates substantial potential—as the solving capabilities of individual models continue to improve, our approach is expected to achieve even higher resolution rates in the future.

3. Future Work

Our future work will focus on:

Improving single-run success rates: Exploring strategies to enhance the Agent's performance in a single solving attempt.
The sampling space: Investigating whether increasing the sampling space can enable the model to identify more correct solutions.
Selector agent: With the increment of the sample space size, we want to optimize the performance of the selector agent to find the best patch

Contributions

Contributors: Pengfei Gao, Zhao Tian and Xiangxin Meng
Project Lead: Chao Peng

Meet Us at FSE and ACL

We are attending FSE 2025, presenting our paper, AEGIS: An Agent-based Framework for General Bug Reproduction from Issue Descriptions, and organising an AI-IDE workshop. You can find us at our Booth close to the registration area, and the workshop on June 27th.

Our paper SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning is accepted by the Main Track of ACL 2025. We will present this work at ACL in Vienna. Please talk to us if you are attending ACL and are interested in reinforcement learning for software engineering.

About Trae

Trae (/treɪ/) IDE is your helpful coding partner. It offers features like AI Q&A, code auto-completion, and agent-based AI programming capabilities. When developing projects with Trae, you can collaborate with AI to enhance your development efficiency.