SWE-Explore: Benchmarking How Coding Agents Explore Repositories

Shaoqiu Zhang, Yuhang Wang, Jialiang Liang, Yuling Shi, Wenhao Zeng, Maoquan Wang, Shilin He, Ningyuan Xu, Siyu Ye, Kai Cai, Xiaodong Gu

SWE-Explore isolates repository exploration from patch generation, benchmarking agents on line-level context selection.

How can we isolate and measure the effectiveness of a coding agent's repository exploration strategy, independent of its ability to write code?

Coding agents are typically evaluated on a binary pass/fail basis, which obscures whether a failure stems from poor repository exploration or an inability to synthesize a correct patch. SWE-Explore formalizes exploration as a standalone task: given an issue, an agent must return a ranked list of code regions, which are then scored against ground-truth lines derived from successful agent trajectories. Across 848 issues, the benchmark reveals that while modern agents excel at file-level localization, they remain recall-limited at the line level, with exploration quality strongly predicting downstream repair success.

Paper Primer

The benchmark addresses the "black box" nature of end-to-end coding benchmarks by isolating the exploration phase. It treats exploration as a ranked, line-level context-selection task, allowing for a direct comparison between lexical retrievers, dense retrievers, and interactive agents without requiring them to generate a final patch.

The core mechanism relies on trajectory-grounded supervision: the authors extract read actions from multiple successful agent runs, intersect them to identify core evidence, and refine these regions via LLM-assisted filtering and manual audit. This creates a precise, line-level target that represents the actual code spans required to resolve an issue.

Exploration metrics are highly predictive of downstream repair success.

Controlled experiments using a restricted-context repair bridge show that Context Efficiency correlates strongly with resolve rate (r=0.950).

Current coding agents are file-level proficient but line-level recall-limited.

While agents achieve high HitFile and nDCG@500 scores, their line-level recall (Recℓ) typically remains between 0.14 and 0.19.

Why is a new benchmark needed if we already have SWE-bench?

SWE-bench measures only the final outcome, making it impossible to distinguish between an agent that failed to find the right code and one that found the code but failed to write the correct patch. SWE-Explore isolates the exploration step to diagnose these specific failure modes.

How does this benchmark handle the fact that different agents might use different code to solve the same issue?

The benchmark uses a cross-trajectory intersection method to identify "core" evidence consulted by multiple successful agents, supplemented by an LLM-assisted refinement step to include load-bearing optional reads, all of which are then manually audited.

Researchers should shift focus from file-level retrieval to line-level recall, as current agents frequently reach the correct file but fail to surface the specific spans necessary for a successful fix.

The Exploration Bottleneck

We expose the hidden exploration gap in repository‑level coding benchmarks and introduce SWE‑Explore to measure it.

Repository‑level coding benchmarks such as SWE‑bench evaluate agents with a single pass/fail score, which masks the underlying steps needed to solve a bug. This binary view hides two distinct failure modes: (1) the agent never discovers the relevant code, and (2) it discovers the code but cannot synthesize a correct patch. Because the former is invisible to existing metrics, the true capability of “exploration” remains under‑measured.

Effective coding agents must first locate the exact lines that contain the bug or the needed API before they can generate a fix; without reliable exploration the downstream synthesis step is doomed to fail.

SWE‑Explore formalizes the exploration task: given an issue and a repository, an explorer must output a ranked list of code regions within a fixed line budget. The benchmark covers 848 issues across 10 programming languages and 203 open‑source repositories, and derives line‑level ground truth from independent agent trajectories that successfully solved each issue. We evaluate explorers on coverage, ranking quality, and context‑efficiency, and show that these metrics strongly predict downstream repair performance.

**Figure 1.** Motivation of SWE-Explore. A holistic metric of resolution rate conflates exploration, localization, and patch synthesis. SWE-Explore isolates repository exploration as a line-level evaluation target.

Current benchmarks conflate navigation with synthesis, preventing precise measurement of an agent’s ability to locate relevant code.

Benchmark Landscape

Survey of existing coding benchmarks and explorer methods, highlighting SWE‑Explore’s unique coverage.

Repository‑level coding benchmarks have focused either on end‑to‑end issue resolution (e.g., SWE‑bench, Verified, Live) or on intermediate behaviors such as context retrieval (ContextBench, SWE‑Pruner, SWE‑ContextBench). None of them jointly evaluate line‑level, trajectory‑grounded exploration and its downstream impact.

Explorer methods range from classic IR baselines (TF‑IDF, BM25) to dense retrievers and LLM‑driven agents (AutoCodeRover, LocAgent, OrcaLoca, CoSIL, CodeScout). Their evaluations typically reward early retrieval (nDCG) but stop short of measuring line‑level region quality.

**Table 1.** Comparison of SWE-Explore with existing repository-level coding and exploration benchmarks across six design dimensions covering ground-truth granularity, evaluation protocol, and ranked-region assessment.

Defining Exploration Quality

We annotate ground‑truth core context from multiple successful trajectories and define line‑level exploration metrics.

We treat code spans that repeatedly appear in independent successful runs as the “core” of the solution, then optionally add model‑specific reads that truly matter.

Normalize reads: A → {(utils.py,10,30), (utils.py,50,70)}; B → {(utils.py,20,40), (utils.py,50,70)}.

Intersect line‑wise: overlap of the first reads is lines 20‑30, second reads overlap exactly (utils.py,50,70). $R_{\text{int}}$ = {(utils.py,20,30), (utils.py,50,70)}.

Assume Agent A also reads (utils.py,80,90) which is not in $R_{\text{int}}$; this becomes an optional read $R(\text{A})$.

LLM refinement promotes (utils.py,80,90) because it contains a helper function used by the patch; manual audit confirms it.

Final $R_{\text{core}}$ = $R_{\text{int}}$ ∪ {(utils.py,80,90)}.

The intersection captures code that is essential across independent solutions, while optional reads let the benchmark reflect useful but model‑specific context.

Filter raw issue‑resolution instances to retain only those with ≥2 successful trajectories from strong LLMs.

For each retained instance, collect all read actions from each trajectory and map them to $(p, s, e)$ regions.

Compute $R_{\text{int}} = \bigcap_{\tau \in T} R(\tau)$ by intersecting regions file‑wise at line granularity.

For each model family $m$, compute optional reads $R(m) = \bigcup_{\tau \in T_m} R(\tau) \setminus R_{\text{int}}$.

Run an LLM‑based refinement that promotes a small subset of $R(m)$ when the reads are load‑bearing for the issue.

Manually audit the refined set to produce the final $R_{\text{core}}$ used for evaluation.

Metrics evaluate how well an explorer’s ranked list of code regions covers the audited core context, both at fine line granularity and at coarser file/region levels.

How does this differ from a simple majority‑vote over read regions?

Majority voting would keep any region seen by more than half the agents, ignoring the line‑wise overlap that guarantees the region is truly common. Our intersection operates at line granularity, so only the exact overlapping lines survive, producing a stricter and more reproducible core.

**Figure 2.** Overview of SWE-Explore. From solution-verified trajectories, SWE-Explore extracts read actions, aggregates them into core and optional context, and evaluates explorers using both upstream exploration metrics and downstream restricted-context validation.

**Figure 3.** Language distribution of the 848 retained SWE-Explore instances across 10 different coding languages.

**Table 2.** Per-instance averages of the ground-truth core context $|R_{core}|$ at the file, region, and line granularity.

Experimental Setup

We detail the explorer families, baselines, metrics, and evaluation protocol used on SWE-Explore.

The experimental setup defines which explorers are compared, the metrics reported, and the protocol for evaluating them on the SWE‑Explore benchmark. All explorers output their top‑K=5 predicted code regions, which are then scored against the ground‑truth core context.

Explorer families group the retrieval methods we compare, ranging from simple baselines to full coding agents.

How does a dense retriever like the RAG‑Potion pipeline differ from sparse baselines such as BM25?

Dense retrievers embed code tokens and perform nearest‑neighbor search in a continuous vector space, whereas sparse methods compute term‑frequency weighted scores on exact lexical matches.

Restricted‑Context Validation measures an explorer’s ability to locate the core code region when only a limited repository view is provided.

Why evaluate with a restricted view instead of the full repository?

Limiting the context isolates the retrieval capability, ensuring the metric reflects pure exploration rather than benefiting from irrelevant files.

Oracle returns the ground‑truth core region directly, establishing an upper bound on retrieval performance.

What would happen if the Oracle returned an approximate region instead of the exact core?

The reported upper bound would drop, shrinking the performance gap between learned explorers and the Oracle and obscuring the true difficulty of the task.

BM25 is a classic sparse lexical retriever that scores documents by term frequency and inverse document frequency.

Why might BM25 struggle with code compared to natural language?

Code identifiers often appear infrequently and share many symbols, reducing the term‑frequency discrimination that BM25 relies on.

Select an explorer from the families defined above.

Fix the retrieval budget K=5 for every run.

Run the explorer on each issue in the SWE‑Explore benchmark, providing the repository snapshot.

Collect the top‑5 predicted file regions for each instance.

Score the predictions using Precision, nDCG@500, HitFile, and Context Efficiency (plus auxiliary metrics).

Aggregate the scores across all instances to obtain the final numbers reported in the paper.

Ground‑truth core region is file C (the only file that actually fixes the bug).

Precision@5 = 1/5 = 0.20 because only one of the five returned files is the core.

nDCG@5 = (1 / log₂(3)) ≈ 0.63 because the core appears at rank 3.

HitFile = 1 (the core file appears in the top‑5).

Context Efficiency = (size of core region) / (size of union of returned regions) ≈ 0.45.

Even a simple lexical retriever can achieve a non‑zero HitFile, but its low Precision and Context Efficiency reveal limited relevance to the true core.

**Figure 4.** Example of a SWE-Explore instance. **Left:** an issue plus a repo snapshot with the highlighted core span. **Right:** trajectory-derived core regions $C_{core}$ (scoring target) and optional regions $C_{opt}$, an explorer’s ranked prediction scored against $C_{core}$.

Results: Exploration Quality

Agentic explorers markedly improve resolve rates and correlate with downstream success.

Agentic explorers reach resolve rates up to 59 % while lexical retrievers stay below 27 %.

Table 3 shows Oracle 59.7 % and CoSIL 59.3 % versus TF‑IDF 26.0 % and BM25 12.7 %.

**Table 3.** Downstream resolve rate under the restricted-context validation environment (GPT-5.4 with Mini-SWE-Agent, $K=5$).

**Table 4.** Explorer-level correlation between each up-stream exploration metric and downstream resolve rate, computed across all explorers in our pool. $\downarrow$ marks lower-is-better.

**Table 5.** Exploration quality at $K=5$ across different LLMs powering the same Mini-SWE-Agent scaffold. Bold marks the best result per column; underline marks the second best.

Robustness and Context Degradation

Dataset details: instance count, language spread, schema, and repository snapshot assumptions.

Recall that coding agents often fail because they cannot efficiently locate the relevant code in a large repository; SWE‑Explore isolates this exploration phase to benchmark agents against ground‑truth trajectories.

When the visible core context shrinks, a patcher’s ability to produce a correct fix drops sharply once essential pieces disappear, even if extra irrelevant code is added.

Why does missing core context hurt more than adding redundant, irrelevant code?

Patchers need a set of essential code fragments to trigger the correct reasoning path; without them the model cannot form the necessary causal links, whereas extra irrelevant fragments are simply ignored once the essential pieces are present.

**Figure 5.** Resolve rate as the visible context degrades from the Oracle's full core set $R_{core}$ to either $\alpha\%$ of $R_{core}$ alone (GT scaling, solid) or $\alpha\%$ of $R_{core}$ padded back to full size with random non-core regions (noise injection, dashed).

The dataset schema records each instance’s identifier, repository, source benchmark, problem statement, core and optional region annotations, and provenance of successful trajectories, all anchored to a fixed repository snapshot.

Instance Case Study

General agents achieve full file hits while lexical baselines miss entirely, mirroring the global ordering.

The paper’s central premise is that coding agents often fail to locate the relevant code in large repositories; SWE‑Explore isolates this “exploration” phase by benchmarking agents against ground‑truth trajectories derived from successful resolutions.

General‑purpose agents achieve a perfect file‑hit rate (1.00) whereas lexical baselines remain at zero, reproducing the global ordering reported in §4.3.

Table 8 shows the Oracle and all general agents (Claude Code, Mini‑SWE‑Agent, AweAgent, Codex, OpenHands) with HitFile = 1.00, while Random, TF‑IDF and Potion all have HitFile = 0.00. 100% increase in HitFile compared to lexical baselines

What is held constant across all explorers is the ground‑truth definition: two core files spanning only 26 lines (supervised.py 850–870 and `test_supervised`.py 245–249) and the evaluation metrics (HitFile, Noise, Recℓ, F1, Cov) computed on the same top‑5 output budget.

Three patterns emerge. First, lexical retrievers (Random, TF‑IDF, Potion) never hit the target file because the bug description lacks distinctive identifiers. Second, academic localizers pinpoint the implementation file but miss the test file, capping their HitFile at 0.50. Third, general‑purpose agents retrieve both files, differing only in how much surrounding context they include.

**Table 8.** Top-5 outputs and metrics on scikit-learn/scikit-learn#10844. HF = HitFile, Noise = NoiseFile, Rec$_\ell$ = line recall, F1 = line F1, Cov = weighted core coverage.

**Table 8.** Per-explorer outputs. Table 8 reports every explorer's top-5 output and the resulting metrics. Regions are abridged to fit the column; entries in [ brackets ] mark a region whose file overlaps a ground-truth file.

Ground-Truth Construction Details

How line‑level targets are extracted, normalized, and refined for evaluation.

This appendix details the pipeline that turns raw read traces into line‑level ground‑truth targets used for scoring.

We first extract observable file‑reading behavior and map it to repository‑relative line regions. Three signal types are accepted: editor view calls that include an explicit file path and line range, command‑line reads such as cat, head, tail, or sed -n when the target file can be resolved, and grep‑style search hits that report line numbers. Signals that cannot be uniquely tied to a file‑interval pair are discarded.

Path normalization ensures that only reads inside the repository are kept. Absolute paths are accepted only if they resolve within the checkout; relative paths are cleaned by removing redundant ./ components and resolving .. where possible. Reads that map to multiple candidate files or to locations outside the repository are dropped.

Each remaining read is converted into a tuple $(p,s,e)$, where $p$ is the repository‑relative file path and $[s,e]$ is a 1‑indexed closed line interval. Whole‑file reads are expanded to the file’s line count, out‑of‑range intervals are clipped to valid boundaries, empty intervals are removed, and overlapping or adjacent intervals from the same trajectory and file are merged.

Let $T$ be the set of successful trajectories for an instance and $R(\tau)$ the merged line regions read by trajectory $\tau$. The **core context** is the file‑wise intersection across all successful trajectories: $R_{\text{raw}}^{\text{core}} = \bigcap_{\tau\in T} R(\tau)$. The **optional context** is everything read by any successful trajectory that lies outside this intersection: $R_{\text{raw}}^{\text{opt}} = \bigcup_{\tau\in T} R(\tau) \setminus R_{\text{raw}}^{\text{core}}$.

SWE‑Explore adopts the refined core as the primary scoring target because it contains only the lines that every successful solution consulted. Optional context is retained for diagnostics and for computing a context‑efficiency metric, but it may include exploratory detours and redundant file openings.

Pure intersection can miss load‑bearing evidence when different agents use distinct but equivalent code. To recover such evidence, we run an LLM‑assisted refinement step: optional regions that are repeatedly visited, adjacent to core evidence, or close to modified code are presented to a refinement model together with the issue statement and a compact summary of successful trajectories. The model outputs a binary decision, a short rationale, and a precise line interval to promote into the refined core; candidates without a promotable interval are rejected.

Every promoted region is then manually audited. Auditors check that the region exists in the evaluated checkout, that it is relevant to the issue (rather than merely adjacent or frequently opened), and that its inclusion plausibly reflects evidence a successful solution relied on. Regions failing any check are removed, keeping the target conservative while recovering missed load‑bearing context.

For analysis we keep three target variants: (1) the pure‑intersection target (maximally conservative), (2) the refined‑core target (the default used in experiments), and (3) the full‑union target (high‑recall but noisy). These variants let us study the trade‑off between recall and precision in the evaluation of exploration agents.

Metric Definitions

Defines the line‑level metrics and ranking calculations used to evaluate explorers.

Metrics operate on repository‑relative line identifiers $(p,\ell)$, where $p$ is a normalized file path and $\ell$ is a 1‑indexed line number. For a region $r$ we write $L(r)$ for the set of its line identifiers; the union over all predicted regions is $L(P)$. The core target set is $Y = L(R_{\text{core}})$.

For a line budget $B$, $P_{\le B}$ is the longest prefix of the prediction whose cumulative visible lines do not exceed $B$. This penalises early large regions that would exhaust the budget before more useful evidence appears.

The ideal DCG is built greedily: at each step the evaluator picks the remaining ground‑truth region that adds the most uncovered core lines while respecting the remaining line budget. Ties are broken by shorter region length and then by repository path, yielding a budget‑matched upper bound for each instance.

Noise rate is the fraction of predicted regions that overlap neither core nor optional context, i.e. regions that provide no useful evidence.

Metrics are computed per instance and then averaged across instances; empty predictions receive zero for all coverage, ranking, first‑hit, and efficiency metrics.

The restricted‑context protocol tests whether the selected regions suffice for actual patch generation under a fixed patching scaffold.

For each explorer output we normalise paths, clip intervals to file boundaries, and replace lines outside the selected intervals with blanks, preserving repository structure for downstream debugging.

All restricted‑context runs share the same patcher, prompt template, tool set, and interaction budget; only the visible context varies.

After a patch is generated it is applied to the original checkout and evaluated with the benchmark’s executable harness; only patches that compile and pass the tests count as resolved.

Unresolved runs are logged with coarse failure reasons such as no diff, invalid diff, patch apply failure, out‑of‑context patch, test failure, timeout, or infrastructure error.

Explorer outputs are normalised to an ordered list of at most $K=5$ repository‑relative line regions, each defined by a path and a closed line interval.

BM25 and TF‑IDF rank repository chunks by lexical similarity to the issue statement; the top chunks are converted into line regions, while Potion uses a lightweight dense retriever with the same interface.

Agentic explorers run under their original search scaffolds, then their file, function, or region outputs are mapped to the most specific line‑level spans supported by the method.

Before scoring, each prediction is checked for path validity, interval validity, and repository membership; invalid or empty predictions are discarded.

Read the original paper

Open the simplified reader on Paperglide