Introducing SWE-bench Verified: Setting a New Standard in Autonomous Software Engineering

Introducing SWE-bench Verified: Setting a New Standard in Autonomous Software Engineering

Jul 8, 2024

By

ModelBox Team

Introduction

OpenAI has introduced SWE-bench Verified, a refined version of the SWE-bench evaluation suite for assessing AI models' software engineering capabilities. This new benchmark consists of 500 human-validated samples, addressing issues such as overly specific unit tests and underspecified problem descriptions found in the original dataset. The creation process involved 93 experienced developers screening 1,699 samples, resulting in the removal of 68.3% of the original data due to various issues. SWE-bench Verified shows improved performance metrics, with GPT-4o achieving a 33.2% success rate, more than double its performance on the original benchmark. This development underscores the importance of continually refining AI evaluation methods and considering external enhancements when assessing model capabilities and potential risks.

Background:


SWE-bench is a popular evaluation suite for assessing large language models' (LLMs) capabilities in software engineering tasks. It challenges AI agents to resolve actual software issues sourced from GitHub by generating appropriate code patches. While the benchmark has shown promising results, with top-scoring agents achieving 20% on SWE-bench and 43% on SWE-bench Lite, our internal testing revealed some limitations that could lead to underestimating models' true capabilities.

Key Issues Addressed:

  1. Overly specific or unrelated unit tests

  2. Underspecified issue descriptions

  3. Difficulties in setting up reliable development environments

SWE-bench Verified: A Collaborative Effort


In collaboration with the original SWE-bench authors, we've developed SWE-bench Verified to address these concerns. This refined dataset consists of 500 samples that have been carefully screened by professional software developers. The new benchmark offers several improvements:

  1. Better-specified tasks and issue descriptions

  2. More appropriate unit tests for solution evaluation

  3. A new Docker-based evaluation harness for easier and more reliable testing

Methodology:


Our team worked with 93 experienced Python developers to manually review 1,699 random samples from the original SWE-bench test set. Each sample was annotated by three separate developers to ensure high quality and consistency. The annotation process focused on two main criteria:

  1. The clarity and specificity of the issue description

  2. The validity of the FAIL_TO_PASS unit tests

Samples were rated on a scale of 0-3 for each criterion, with 2 and 3 indicating severe issues that warranted removal from the dataset.

Results and Impact:


The annotation process revealed that 38.3% of samples had underspecified problem statements, and 61.1% had unit tests that could unfairly mark valid solutions as incorrect. Overall, 68.3% of the original SWE-bench samples were filtered out due to various issues.

Performance on SWE-bench Verified:


Initial testing with GPT-4o using various open-source scaffolds showed significant improvements in performance:

  • GPT-4o's performance reached 33.2% on SWE-bench Verified, more than doubling its previous score of 16% on the original SWE-bench.

  • Performance improvements were observed across different difficulty levels, indicating that the new benchmark better represents model capabilities rather than simply shifting towards easier tasks.

Implications and Future Directions:


The development of SWE-bench Verified highlights several important considerations for AI evaluation:

  1. The need for in-depth understanding and continuous refinement of benchmarks

  2. The importance of accounting for ecosystem progress, including advancements in model scaffolding

  3. Awareness of inherent limitations in static dataset-based evaluations

Conclusion:


SWE-bench Verified represents a significant step forward in accurately assessing AI models' software engineering capabilities. By addressing key limitations of the original benchmark, it provides a more reliable tool for tracking progress in this critical area of AI development. As we continue to advance towards more capable AI systems, the need for robust, well-calibrated evaluations becomes increasingly important.

The SWE-bench Verified dataset, along with the full set of annotations and the annotation rubric, are now available for download and use by the AI research community.

Learn more about ModelBox

Official Website: https://www.model.box/

Models: https://app.model.box/models

Medium: https://medium.com/@modelbox

Discord: discord.gg/HCKfwFyF

Ship with ModelBox

Ship with ModelBox

Ship with ModelBox

Build, analyze and optimize your LLM workflow with magic power of ModelBox

Build, analyze and optimize your LLM workflow with magic power of ModelBox

Build, analyze and optimize your LLM workflow with magic power of ModelBox