DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch

Jiale Zhao, Guoxin Chen, Fanzhe Meng, Wayne Xin Zhao, Ruihua Song, Ji-Rong Wen, Kai Jia

DeNovoSWE provides a large-scale, verifiable dataset for training agents to generate entire software repositories from documentation.

How can we scale the generation of entire software repositories from high-level documentation by using a multi-agent, difficulty-aware data construction pipeline?

Current code agents excel at localized bug fixing but struggle to architect complete software repositories, largely because existing datasets lack the long-horizon, verifiable data required for such complex tasks. The authors introduce DeNovoSWE, a framework that uses a sandboxed, multi-agent "divide-and-conquer" pipeline to automatically generate 4,818 high-quality document-to-repository instances. To ensure data quality, they employ an iterative critic-repair mechanism and a difficulty-aware filtering strategy that dynamically adjusts quality thresholds based on task complexity. Fine-tuning on this dataset significantly boosts long-horizon performance, raising the score of a 30B-parameter model from 5.8% to 47.2% on the BeyondSWE-Doc2Repo benchmark.

Paper Primer

The core challenge in whole-repository generation is creating documentation that is both comprehensive and executable. DeNovoSWE addresses this by decomposing repositories into functional capabilities, then using a critic-repair loop to iteratively refine the documentation until it provides sufficient behavioral constraints for an agent to reproduce the original functionality.

The difficulty-aware filtering mechanism is the system's primary quality control: it maps each task to a difficulty score based on structural complexity and LLM-based judgments, then relaxes the required unit-test pass threshold for harder tasks. This prevents the pipeline from discarding valuable, partially successful trajectories that are common in complex, long-horizon coding.

Training on DeNovoSWE substantially improves long-horizon repository generation performance.

Fine-tuning Qwen3-30B-A3B on DeNovoSWE increased performance on the BeyondSWE-Doc2Repo benchmark from 5.8% to 47.2%.

Difficulty-aware filtering outperforms static, global thresholds.

Ablation studies show that adapting thresholds to instance difficulty yields higher downstream performance than applying a single, fixed pass-rate threshold across all tasks. Improved BeyondSWE-Doc2Repo scores from 0.488 to 0.500 compared to the best fixed-threshold baseline.

Why is this dataset necessary if we already have benchmarks like SWE-bench?

SWE-bench focuses on issue-level bug fixing, which does not stress the long-horizon planning and interdependent coding required to build an entire repository from scratch.

How does the framework prevent agents from simply "cheating" by accessing the original source code?

The pipeline uses a rigorous pre-cleanup phase that strips existing code, purges site-package traces and hidden caches, and re-initializes the Git directory, combined with runtime command restrictions to block network-based source recovery.

DeNovoSWE demonstrates that high-quality, automatically curated long-horizon data can bridge the performance gap between open-weight models and proprietary frontier agents in complex software engineering tasks.

The Shift to Repository Generation

LLM agents must move from fixing isolated bugs to building full software repositories.

LLM code agents have shown strong bug‑fixing ability, yet real‑world software demands the creation of entire repositories from high‑level documentation. Existing training sets remain centered on single‑issue fixes, leaving agents without the long‑horizon planning experience needed for repository‑scale tasks.

**Figure 1.** Overview of DeNovoSWE and its role in scaling long-horizon software engineering tasks. Left: DeNovoSWE extends prior SWE datasets along both task scope and task difficulty, moving from localized issue fixing in existing codebases to whole-repository generation from scratch, thereby requiring agents to shift from maintainer-like localized editing toward architect-level repository construction. Right: DeNovoSWE provides substantially larger-scale repository-generation supervision, containing 4,818 tasks, about 46× larger than NL2Repo.

Training agents to build whole repositories requires a dataset that forces long‑horizon planning, so we create DeNovoSWE—a large collection of document‑to‑repository tasks.

Our contributions are threefold: (i) an automated sandboxed pipeline that yields the DeNovoSWE dataset, (ii) a difficulty‑aware trajectory filtering mechanism that balances data quality with task diversity, and (iii) a DeNovoSWE‑trained agent (Qwen3‑30B‑A3B) that lifts the BeyondSWE‑Doc2Repo benchmark score from 5.8 % to 47.2 %.

The transition from localized bug fixing to whole‑repository architecture hinges on having verifiable, large‑scale training data, which DeNovoSWE supplies.

The Divide-and-Conquer Framework

DeNovoSWE builds verifiable documentation by recursively splitting a repository into capabilities and iteratively refining each piece.

The core obstacle is that a single pass cannot reliably produce documentation that both describes a whole repository and satisfies its unit‑test suite.

The repository is first split into independent capabilities; each capability gets its own focused documentation, which is later stitched back together.

How does this Divide‑and‑Conquer differ from a naïve hierarchical decomposition that simply nests modules?

The key difference is that our framework explicitly profiles execution traces to separate direct, core‑indirect, and non‑core components, then discards the non‑core parts. A naïve hierarchy would keep every nested module, inflating the documentation and leaking unnecessary implementation details.

Divide: the overview writer creates a summary “This repo provides file I/O and string utilities.”

Profile: unit tests invoke $f_{\text{read}}$ and $f_{\text{write}}$, marking them direct; the trace from $f_{\text{read}}$ calls $f_{\text{upper}}$, marking it core‑indirect.

Map: the classifier assigns $f_{\text{read}}$, $f_{\text{write}}$ to $a_1$ and $f_{\text{upper}}$ to $a_2$.

Draft (t=0): $D_1^{(0)}$ = “The file I/O capability provides read/write functions with signatures …”. $D_2^{(0)}$ = “The string utility offers an upper‑case function …”.

Critic: detects that $D_2^{(0)}$ omits the import path for $f_{\text{upper}}$.

Repair: updates $D_2^{(1)}$ to include “Import path: utils.string.upper”.

The loop adds only the missing pieces identified by the critic, keeping each capability document concise yet complete.

Initialize $t \leftarrow 0$ and generate $D_i^{(0)}$ for each capability $a_i$ using the draft agent.

Run the critic agent to produce $C_i^{(t)}$, highlighting missing direct or core‑indirect components.

If $C_i^{(t)}$ is non‑empty, invoke the repair agent to obtain $D_i^{(t+1)}$ and increment $t$.

Repeat the critic‑repair cycle until $C_i^{(t)}$ is empty or the iteration budget is exhausted.

Merge all finalized $D_i$ with the repository overview to form the complete task documentation.

**Figure 2.** Overview of the DeNovoSWE framework based on a Divide-and-Conquer design. In the Divide Phase (Top), the repository is decoupled via concurrent tracks: Repository Ability Partitioning for capability extraction and Repository Profiling for code-dependency tracing. These aspects are consolidated through an LLM-as-a-Judge to map high-level abilities onto specific code structures. F. & C. denotes functions and classes. In the Conquer Phase (Bottom), an iterative multi-agent pipeline (Draft-Critic-Repair) is executed for Ability-Level Document Generation. The resulting merged documentation is fed into a sandboxed Golden Environment, where the software agent's performance is rigorously benchmarked under strict network and package deployment constraints to determine the final evaluation score.

Evaluation begins with a rigorous pre‑cleanup that strips source code, erases caches, and sanitizes the Git history, guaranteeing a closed‑book task.

Runtime cheating is mitigated by a command‑restriction policy that blocks network fetches and by an LLM‑as‑judge audit that flags suspicious shell activity.

Repositories are retained only if their original unit‑test suite passes ≥ 90 % and test coverage exceeds 50 %, ensuring both executability and sufficient behavioral constraints.

Table 1 shows that DeNovoSWE contains 4,818 instances, with median coverage 89.6 % and median unit‑test count 9, far surpassing prior benchmarks.

Difficulty-Aware Trajectory Filtering

Adaptive filtering keeps hard trajectories while enforcing quality on easy ones.

Instead of a single static cutoff, the filter sets a per‑instance score threshold that follows the estimated difficulty of the repository, preserving informative hard trajectories while tightening quality on easy ones.

Why not simply use a fixed score threshold for all repositories?

Because repository difficulty varies widely: a hard repository may never reach a 0.95 pass‑ratio even when the agent solves most of the code, while an easy repository should be held to a stricter standard. A static cutoff would discard valuable hard examples and admit low‑quality rollouts on easy tasks.

**Figure 3.** Left: trajectory scores decrease as the estimated difficulty score increases, showing that fixed score thresholds would disproportionately discard trajectories from harder repository-generation instances. The red curve reports the mean trajectory score within each difficulty bin. Right: DeNovoSWE covers a broad range of task difficulties, with instances distributed across the full difficulty spectrum. These observations motivate our difficulty-aware trajectory filtering strategy, which adapts the filtering threshold according to instance difficulty.

Experimental Evaluation

DeNovoSWE agents achieve record performance on repository‑generation benchmarks.

DeNovoSWE‑Agent‑35A3B reaches 50.0% on the BeyondSWE‑Doc2Repo benchmark, closing the gap to proprietary models.

Table 3 reports 0.500 for DeNovoSWE‑Agent‑35A3B versus 0.617 for the strongest proprietary system, a 2.0‑percentage‑point difference.

Across both Qwen backbones, DeNovoSWE‑trained agents consistently outperform prior open‑weight baselines, with the 30B model improving from 5.8% to 47.2% on Doc2Repo and the 35B model adding 6.3 pp to reach 50.0%. The ablation in Table 4 confirms that difficulty‑aware filtering further lifts performance, especially on harder instances. These results demonstrate that high‑quality long‑horizon trajectories are the primary driver of the gains.

OpenHands is an event‑driven platform that lets LLM agents edit files and run shell commands inside isolated containers, providing a reproducible sandbox for software‑engineering tasks.

How does OpenHands differ from generic tool‑use wrappers?

OpenHands records a full event log and enforces container isolation, whereas generic wrappers typically invoke a single command without tracking intermediate state, making OpenHands uniquely suited for generating the detailed trajectories required by DeNovoSWE.

**Table 1.** Performance comparison of various models on Doc2Repo and NL2Repo benchmarks.

**Table 4.** Evaluation results of downstream tasks across different trajectory filtering threshold strategies. The intervals represent the difficulty score ranges, while the corresponding inner values denote the filtering thresholds yielded by the model on the DeNovoSWE dataset. All reported metrics are averaged across three independent execution trials to ensure statistical stability.

DeNovoSWE‑trained agents substantially outperform prior open‑weight baselines on repository‑generation benchmarks.

Conclusion and Dataset Structure

We recap DeNovoSWE’s contributions and its impact on long‑horizon repository generation.

Generating whole repositories from documentation demands long‑horizon planning beyond issue‑level benchmarks. DeNovoSWE addresses this gap by providing a large‑scale, verifiable dataset that supports such planning.

The dataset was built via a divide‑and‑conquer pipeline with an iterative critic‑repair loop, yielding 4,818 high‑quality instances. We also produced trajectories with DeepSeek‑V4 and applied difficulty‑aware filtering to balance execution quality and task diversity.

Empirically, fine‑tuning Qwen3‑30B‑A3B‑Instruct on DeNovoSWE raises BeyondSWE‑Doc2Repo accuracy from 5.8% to 47.2%. The larger Qwen3.5‑35B‑A3B backbone shows consistent gains, reaching 50.0% on BeyondSWE‑Doc2Repo and 27.1% on NL2RepoBench.

**Table 5.** DeNovoSWE data structure specification.

Training Hyperparameters

This appendix lists the SFT and evaluation hyperparameters and describes the license filtering applied to the dataset.

The supervised fine‑tuning (SFT) stage uses a small set of core hyperparameters that control batch processing, context length, and learning‑rate scheduling.

This table lists various training hyperparameters and their corresponding values.

For evaluation on the NL2Repo‑Bench and BeyondSWE‑Doc2Repo benchmarks, a separate hyperparameter set governs generation behavior and token limits.

The table lists various configuration parameters and their corresponding values.

Before constructing the training corpus, DeNovoSWE filters repositories by license, keeping only those with permissive terms such as MIT, Apache‑2.0, BSD, ISC, CC0‑1.0, and similar open‑source licenses.

Repository Overview Prompt

Defines the prompt used to generate repository documentation overviews.

This appendix specifies the exact prompt that drives the DeNovoSWE data‑construction pipeline. The prompt instructs the model to synthesize a concise, natural‑language overview of a software repository, grounding the description in the repository’s README, file tree, and capability outline.

Repository overview prompt template

Repository Ability Prompt

Unable to produce the ability documentation without repository specifics.

The source material for this appendix is missing; without the repository’s README, structure, and symbol list we cannot generate the ability prompt.

Read the original paper

Open the simplified reader on Paperglide