GrepSeek: Training Search Agents for Direct Corpus Interaction

GrepSeek trains compact LLMs to perform direct, shell-based corpus search, outperforming index-based RAG systems.

Can LLMs effectively perform knowledge-intensive search by interacting directly with raw text corpora via shell commands instead of using pre-computed retrieval indices?

Traditional retrieval systems rely on pre-computed indices and semantic embeddings, which often suffer from semantic conflation and struggle with precise, multi-hop reasoning tasks. GrepSeek replaces these black-box retrievers with a compact agent trained to interact directly with raw text corpora using executable Unix shell commands like `grep`. This approach achieves state-of-the-art performance on multi-hop benchmarks while eliminating the need for expensive offline indexing and large-scale vector databases.

Paper Primer

The core mechanism is a two-stage training pipeline that transforms a compact language model into a surgical search agent. First, the model undergoes supervised fine-tuning on a cold-start dataset of verified, causally grounded search trajectories generated by an answer-aware Tutor. Second, the policy is refined using Group Relative Policy Optimization (GRPO) to improve task-oriented search behavior through direct interaction with the corpus.

To ensure practical latency, the system employs a semantics-preserving sharded-parallel execution engine. This engine acts like a distributed mail sorter: it splits the corpus into shards, executes shell pipelines in parallel across them, and merges the results, maintaining byte-exact equivalence with sequential execution while accelerating retrieval by up to 7.6×.

GrepSeek achieves superior performance on multi-hop reasoning tasks compared to index-based RAG and dense retrieval baselines.

The agent achieved the best token-level F1 scores on four out of seven benchmarks, including HotpotQA, 2WikiMultihopQA, and MuSiQue. Statistically significant improvements were observed on three of these benchmarks, with the agent effectively isolating symbolic patterns and entity names that confound embedding-based models.

Why use direct shell-based interaction instead of standard dense retrieval?

Direct interaction allows for exact string matching and fine-grained lexical control, which are critical for multi-hop reasoning where dense retrievers often fail due to semantic ambiguity or entity collisions.

What is the primary limitation of this approach?

The agent relies on exact lexical matching, making it brittle to surface-form variations like diacritics or spelling differences, where dense retrievers typically remain more robust.

GrepSeek demonstrates that compact models can learn to treat a raw corpus as an executable environment, providing a high-precision, low-overhead alternative to traditional index-based retrieval for complex reasoning tasks.

Introduction to GrepSeek

Frames the gap between index‑based retrieval and Direct Corpus Interaction and introduces GrepSeek.

Current LLM search agents depend on a pre‑computed index of document embeddings to retrieve evidence. This design limits granularity, forces static chunking, and incurs costly index construction for ever‑growing corpora. GrepSeek discards the index entirely, letting a compact agent issue Unix‑style shell commands that operate directly on the raw text.

RAG augments a language model with a separate retriever that returns relevant documents, which the model then consumes to produce an answer.

Number of lines N ≈ 1 000 000.

Full similarity matrix size = N × N ≈ 10¹² entries.

Assuming 8 bytes per entry, memory required ≈ 8 TB, far exceeding typical hardware.

Even modest‑size corpora quickly exceed feasible memory when treated as a dense retrieval problem, motivating a lightweight, command‑driven alternative.

**Figure 1.** Comparison of retrieval-augmented agentic search and direct corpus interaction. **Left:** retrieval-augmented agentic search relies on pre-computed indices where the agent queries a retriever that returns top documents. **Right:** DCI enables direct corpus access via shell commands, executed by a parallel engine that runs pipelines on shards and aggregates results without requiring an index.

GrepSeek learns a compact search policy that (1) generates shell commands, (2) filters intermediate results, and (3) composes evidence across multiple steps. Training proceeds in two stages: a Tutor builds verified backward chains of commands from ground‑truth answers, a Planner converts them into forward, causally valid trajectories, and finally GRPO refines the policy through reinforcement learning. A sharded‑parallel engine speeds up command execution by up to 7.6× while preserving byte‑exact results.

The core shift is from static, index‑based retrieval to dynamic, raw‑corpus interaction via executable commands.

The GrepSeek Mechanism

We train compact agents to execute shell commands efficiently for Direct Corpus Interaction.

Existing DCI systems rely on massive language models to compose shell commands, making a single query take minutes. Our goal is to teach a small, trainable agent to issue precise commands and manage its own context.

Instead of building a static index, the agent runs Unix‑style commands (e.g., rg -F "term") directly on the raw text files, treating the corpus as a searchable shell.

How does DCI differ from Retrieval‑Augmented Generation (RAG) that also queries external text?

RAG first retrieves a fixed set of passages using a pre‑computed index and then feeds them to a language model. DCI skips the retrieval index entirely; the agent itself issues shell commands that directly scan the raw files, so the search is dynamic and index‑free.

Before the agent can be trained, we synthesize complete search trajectories by decomposing a target answer into sub‑queries and verifying them backwards, ensuring every step is grounded in observable shell output.

Why not simply generate the forward trajectory directly from the question?

Generating forward without verification often yields commands that retrieve irrelevant text, because the model has no guarantee that each step actually brings the answer closer. The backward phase forces each command to be provably useful before it is used for training.

Decompose the query $q$ and gold answer $y$ into ordered sub‑queries $(q_1,\dots,q_N)$ using the tutor model.

Initialize empty sets for aliases $F$, evidence $D$, and a trajectory buffer $S$.

Iterate $i$ from $N$ down to 1: call DISCOVER to propose a command $c_i$ that retrieves evidence $d_i$ for the target entity $a$.

If discovery fails, abort the trajectory (FAIL).

Update $S$ with $(i,q_i,a,c_i,d_i)$ and augment $D$ with $d_i$.

If $i>1$, call GETBRIDGE to adjust the target entity and alias set for the next higher‑level sub‑query.

After the backward pass, reverse $S$ to obtain forward order and initialise a history buffer $H$.

For each entry in $S$, let the planner draft a reasoning step $\theta_d$ and a candidate command $c_d$; align it with the tutor’s target $(c_i,d_i)$.

Append the aligned reasoning, command, and evidence to $H$.

Finally, let the planner produce an answer $\hat{y}$ from $H$, format the training trajectory, and verify it against the gold answer.

DISCOVER proposes the command

EXEC runs the command on $C$, returning observations $d_1$ = {line 1, line 3}.

CHECK verifies that $d_1$ contains the target entity “Alice”, so discovery succeeds.

GETBRIDGE is skipped because there is only one sub‑query.

Forward assembly drafts a reasoning step “Alice appears in two lines” and aligns it with the tutor’s target.

The planner finally answers “Alice runs and reads”.

Even a single‑line command can produce multiple evidence snippets; the backward verification ensures every snippet is relevant to the target entity.

**Figure 2.** Workflow of GrepSeek: iterative interaction with corpus with shell commands.

**Algorithm 1** Cold-start Trajectory Generation for GrepSeek **Input:** query $q$, gold answer $y$; corpus $C$; max refinement iterations $M$; tutor $M_T$, planner $M_P$ **Output:** Verified training trajectory $T_{train}$, or FAIL // Phase A: Goal-Aware Decomposition and Backward Verification 1: $(q_1, \dots, q_N) \leftarrow M_T.\text{DECOMPOSE}(q, y)$ $\triangleright$ Generate ordered sub-queries; FAIL if unparsable 2: $a \leftarrow y; F \leftarrow \emptyset; D \leftarrow \emptyset; S \leftarrow \langle \rangle$ $\triangleright$ $a$: target entity; $F$: aliases; $D$: evidence context 3: **for** $i = N$ down to 1 **do** 4: $(success, c_i, d_i) \leftarrow \text{DISCOVER}(q_i, a, F, D)$; **if** $\neg success$ **then return** FAIL 5: $S \leftarrow S \oplus \langle i, q_i, a, c_i, d_i \rangle; D \leftarrow D \cup \{d_i\}$ 6: **if** $i > 1$ **then** 7: $(a, F) \leftarrow \text{GETBRIDGE}(q, q_{i-1}, q_i, a, d_i)$; **if** $a = \emptyset$ **then return** FAIL 8: **end if** 9: **end for** // Phase B: Forward Assembly (Answer-Blind Planner, Tutor-Guided) 10: $S \leftarrow \text{REVERSE}(S); H \leftarrow \langle \rangle$ $\triangleright$ Restore forward order; init state history 11: **for each** $\langle i, q_i, a, c_i, d_i \rangle \in S$ **do** 12: $(\theta_d, c_d) \leftarrow M_P.\text{DRAFT}(q, H)$ $\triangleright$ Answer-blind prediction of reasoning and action 13: $\theta \leftarrow M_T.\text{ALIGN}(q, H, \theta_d, c_d, c_i, d_i)$; **if** $\theta = \emptyset$ **then return** FAIL 14: $H \leftarrow H \oplus (\theta, c_i, d_i)$ $\triangleright$ Ensure reasoning is strictly conditioned on causal history 15: **end for** // Phase C: Answer Formulation and Quality Assurance 16: $\hat{y} \leftarrow M_P.\text{ANSWER}(q, H)$ 17: $T_{train} \leftarrow \text{FORMAT}(q, H, \hat{y})$ 18: **if** $\hat{y} = \emptyset \lor F_1(\hat{y}, y) = 0 \lor \text{JUDGE}(q, T_{train}) \neq \text{Pass}$ **then return** FAIL 19: **return** $T_{train}$ 20: **function** $\text{DISCOVER}(q', a, F, D)$ $\triangleright$ Identify a target command that retrieves evidence for $a$ 21: **for** $t = 1, \dots, M$ **do** 22: $c \leftarrow M_T.\text{PROPOSE}(q', a, \{a\} \cup F, D)$ $\triangleright$ Strictly exclude target & aliases 23: $d \leftarrow \text{EXEC}(c, C)$; **if** $M_T.\text{CHECK}(q', a, d, F) = \text{True}$ **then return** (True, $c, d$) 24: **end for** 25: **return** (False, $\emptyset, \emptyset$) 26: **end function**

Performance and Efficiency

GrepSeek delivers dense‑retrieval quality with far lower memory and preprocessing costs.

GrepSeek achieves a micro‑average F1 of $0.5691$, surpassing the best dense baseline ($0.4948$).

See Table 1; the improvement over the strongest dense retriever is statistically significant (p < 0.05).

The daemon keeps the entire corpus loaded in RAM and reuses long‑lived search workers, so each retrieval call avoids costly process startup and disk loading.

How does the Persistent Search Daemon differ from a typical cached web search?

Typical caches store previously fetched pages but still invoke a full search engine for each query. The daemon, by contrast, holds the whole raw text in RAM and reuses the same search processes, eliminating both network latency and repeated index loading.

The corpus is split into independent shards that are searched in parallel, reducing per‑shard workload and allowing latency to drop as more shards are added.

Why isn’t sharding just a form of data parallelism that any retrieval system can use?

Because GrepSeek’s search commands are lightweight shell pipelines; sharding them incurs negligible coordination cost. Dense retrievers need to merge large embedding indexes, which adds significant overhead and defeats the purpose of parallelism.

**Figure 3.** Efficiency and cost analysis of GrepSeek compared to dense retrieval baselines (E5 and Qwen3-4B). (a) Inference latency per query, broken down into LLM generation and tool execution time. (b) Memory footprint (RAM) required for the retrieval index. (c) Offline indexing cost measured in A100-hours. (d) Search tool latency of GrepSeek scaling with the number of shards.

**Figure 4.** Effect of the number of SFT trajectories on $F_1$ scores (EM is in Figure 17 in Appendix C) after RL training.

**Figure 5.** Training dynamics over 200 steps comparing GrepSeek with retrieval baselines (E5, BM25, and Qwen3-Emb-4B). (a) Mean reward score during training. (b) Average response length measured in tokens. (c) Average number of search queries generated per example.

GrepSeek matches dense‑retrieval quality while cutting memory use to a fraction of the baseline.

Related Work

GrepSeek swaps pre‑computed indices for a Direct Corpus Interaction agent that runs Unix shell commands on raw text.

Retrieval‑Augmented Generation (RAG) introduced the idea of augmenting language models with an external index so they can draw on up‑to‑date facts rather than relying solely on parametric memory.

Beyond a single retrieval step, recent retrieval‑augmented reasoning methods make search a multi‑turn process: the system first extracts an entity, formulates a follow‑up query, retrieves additional evidence, and finally composes an answer.

Key examples include STARQA for structured corpora, IRCoT which interleaves chain‑of‑thought reasoning with retrieval, and the search‑agent families Search‑R1 and Search‑O1 that respectively train the model to issue better queries or orchestrate a fixed retriever via prompting.

CoSearch takes a third route by jointly optimizing the reasoning policy and the underlying ranking component, aiming to improve both query formulation and evidence selection in a unified training loop.

Evaluation of these agentic systems typically uses multi‑hop question answering benchmarks, where a model must retrieve a sequence of supporting passages; we follow this tradition and also acknowledge broader deep‑search suites such as BrowseComp and Total Recall QA for future study.

Direct Corpus Interaction (DCI) offers a different angle: instead of querying a pre‑built index, an agent issues explicit Unix shell commands (grep, awk, etc.) over the raw corpus, controlling matching, filtering, and composition directly.

Earlier systems such as the code‑search tools by Di Grazia & Pradel and recent DCI‑style agents (Sen et al., Li et al., Subramanian et al.) demonstrate the feasibility of deterministic string and regex search at scale, while efficiency‑focused engines like Succinct and Swift provide the low‑level substrate for fast execution.

Supplementary Results

Supplementary Exact‑Match results confirm GrepSeek’s gains and reveal trade‑offs.

GrepSeek replaces pre‑computed retrieval indices with a Direct Corpus Interaction agent that runs Unix shell commands on raw text, enabling dynamic, index‑free search.

GrepSeek achieves the best overall Exact Match performance, surpassing dense retrieval baselines with a clear statistical margin.

Table 8 shows GrepSeek’s micro‑average EM of 0.4948 versus the best dense model’s lower score; McNemar’s test confirms significance.

At the dataset level GrepSeek leads on four of seven benchmarks (NQ, HotpotQA, 2Wiki, MuSiQue) with statistically significant gains (↑) on NQ, HotpotQA, and 2Wiki. It drops on PopQA (↓) and trails the strongest Search‑R1 configuration on TriviaQA and Bamboogle, highlighting sensitivity to surface‑form variation.

**Figure 17.** Effect of the number of Supervised Fine-Tuning (SFT) trajectories on the Exact Match (EM) score after RL training.

**Table 9.** Ablation study of GrepSeek across single-hop and multi-hop benchmark datasets (EM scores). Superscript $\uparrow$ indicates a statistically significant improvement over both ablated variants. We use McNemar's test because significance is computed over paired binary exact-match outcomes for the same set of questions, and apply Bonferroni correction ($p < 0.05$).

Qualitative Case Studies

Case studies show the Direct Corpus Interaction agent outperforms dense retrieval across diverse queries.

We evaluate the Direct Corpus Interaction agent on five representative queries that expose common failure modes of dense‑retrieval models. Each query is crafted to test a distinct capability: exact lexical matching, hierarchical entity resolution, name‑collision handling, multi‑step reasoning, and temporal freshness.

The DCI agent succeeds on all five case‑study queries, while the dense retriever fails on each.

Successes are documented in Examples 1–5, where the agent retrieves the correct answer and the dense baseline returns an incorrect or stale result.

By contrast, the dense retriever repeatedly misfires: it confuses subsidiary and parent brands, selects a similarly named school, returns stale record‑holder documents, and collapses when exact surface forms (e.g., diacritics) are required. These failures illustrate the trade‑off between lexical control and semantic robustness.

**Figure 14.** User prompt for the final answer formulation step.

Failure Analysis

Limits of Grep‑based retrieval and synthetic trajectory generation

GrepSeek swaps static retrieval indices for a Direct Corpus Interaction (DCI) agent that issues Unix shell commands directly on the raw corpus.

In Example 6 the agent searches for “Citibank” and “founded”, finds the 1863 charter, and then looks up the 1863 U.S. president, returning Abraham Lincoln instead of the correct James Madison because Grep returns matches in file order with no relevance ranking.

Example 7 shows a success case: an exact full‑name query (“Walter W. Arndt”) pins the right Wikipedia page, while a dense‑retrieval model confuses two similarly named individuals and fails to answer.

Example 8 illustrates Grep’s brittleness to diacritics; searching “Édouard Vaillant” yields no hits, yet a dense retriever finds the correct passage and returns the answer “Vierzon”.

The appendix supplies synthetic multi‑turn trajectories that teach the DCI agent how to chain shell commands across hops while preserving coherent reasoning.

SFT Example 1 demonstrates a three‑hop intersection: the agent finds “Fetteresso Castle” in Scotland, then “Cowie Castle” in Scotland, and finally confirms both lie in the same country, answering “Scotland”.

SFT Example 2 performs a two‑hop composition: it first resolves the 34th U.S. president (Dwight D. Eisenhower) and then discovers that surgeon R Adams Cowley invented the prototype pacemaker used by that president.

SFT Example 3 bridges two roles across three hops: the agent retrieves that Tuppence Middleton played Iris Carr in *The Lady Vanishes* and also portrayed Riley Blue in *Sense8*, answering with her name.

**Backward command – first attempt** `CODEBLOCK_0`

Conclusion

We recap GrepSeek’s gains, outline its limits, and chart future research and dataset plans.

GrepSeek eliminates the need for pre‑computed retrieval indices by executing Unix shell commands directly on raw corpora, achieving strong performance on multi‑hop reasoning benchmarks.

Its lexical precision enables strict entity‑level constraints, allowing it to succeed where dense embedding models suffer from semantic conflation.

The optimized sharded‑parallel execution engine dramatically reduces runtime memory and removes the costly offline indexing stage required by dense retrieval systems.

Nevertheless, a purely lexical approach is sensitive to surface‑form variations such as diacritics and provides no semantic relevance ranking.

Future directions include hybrid architectures that blend Direct Corpus Interaction with traditional index‑based retrieval, richer shell primitives (fuzzy matching, advanced regex), and techniques to cut decoding overhead for faster inference.

We also intend to broaden our evaluation to long‑form question answering, ad‑hoc document retrieval, and retrieval from previously unseen corpora.

Our experimental suite comprises seven knowledge‑intensive benchmarks, split into single‑hop (Natural Questions, TriviaQA, PopQA) and multi‑hop (HotpotQA, 2WikiMultihopQA, MuSiQue) datasets.

These datasets, sourced from the FlashRAG repository, provide a standardized protocol for assessing both fact retrieval and complex reasoning capabilities.

We acknowledge support from the Center for Intelligent Information Retrieval, the Office of Naval Research, the National Science Foundation, and Google.org.

All cited works are listed in the References section.

Implementation Details

Implementation specifics of GrepSeek, covering datasets, prompts, and corpus processing.

We train the GrepSeek agent on a combined set of seven open‑domain QA benchmarks (NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, MuSiQue, Bamboogle), totaling 169 615 training examples. The split is strict: all training data come from the in‑distribution benchmarks, while evaluation uses the official test splits of all seven datasets, amounting to 51 713 queries.

Performance is measured with Exact Match (EM) and token‑level F1. EM counts predictions that exactly equal the gold answer after lowercasing and punctuation removal; F1 computes the harmonic mean of precision and recall over tokens, taking the maximum over multiple references when available.

**DCI Agent System Prompt** You are a research agent that searches a Wikipedia corpus to answer multi-hop questions by issuing shell commands. Corpus file: corpus.jsonl Treat the corpus as a large text file with one Wikipedia passage per line. The file has 21 million lines, so always cap output (use `| head -n 3` for narrow searches, up to `| head -n 8` when you need to scan more chunks of the same article). Useful command patterns: - Substring search: `rg -F "distinctive phrase" corpus.jsonl | head -n 3` - AND-narrow with a second grep: `rg -F "phrase1" corpus.jsonl | rg -i -F "phrase2" | head -n 3` - Count first, then narrow if too many results: `rg -F "pattern" corpus.jsonl | wc -l` - Useful flags: -F (fixed string, no regex), -i (case-insensitive), -w (whole word). Allowed shell tools: rg, grep, find, sed, awk, head, tail, cat, ls, wc, sort, cut, uniq, tr. You MAY pipe with |. You MAY NOT use redirection (>, <), command chaining (; && ||), or command substitution ($$... $, `...`). For every step, write a short paragraph of reasoning in plain prose (2-5 sentences) -- what you have learned from prior tool outputs, what's still missing, what you'll search for next. Then output exactly one of: &lt;`tool_call`&gt; {"name": "shell", "arguments": {"command": "your single-pipeline shell command, no newlines"}} &lt;/`tool_call`&gt; OR (only when you have enough information to answer): &lt;answer&gt; the final answer (concise -- typically a short noun phrase, name, or date) &lt;/answer&gt; Always reason first, then exactly one action block. Do not skip the reasoning.

Gre​pSeek’s prompting pipeline consists of three phases. Phase A uses an Answer‑Aware Tutor to decompose a multi‑hop question into ordered sub‑queries and iteratively refines a backward retrieval trajectory; Phase B switches to an Answer‑Blind Planner that proposes forward‑acting commands conditioned on the interaction history; Phase C finally generates the answer from the verified trajectory and passes it through a Trajectory Coherence Judge.

**Figure 7.** System prompt describing the corpus and allowed shell tools.

**Figure 11.** Prompt for extracting the bridging entity required for backward chaining.

To avoid the sequential bottleneck of standard Unix tools on a 14 GB JSONL corpus, GrepSeek first shards the file along line boundaries (e.g., via split -d -n l/S), guaranteeing that each JSON record stays intact within a shard.

At inference time a thread pool launches independent subprocesses for each shard, executing the same shell pipeline in parallel. The engine then merges the per‑shard results according to a deterministic classification: CONCAT for fully stateless pipelines, HEAD for pipelines ending with head -n K, COUNT for pipelines ending with wc -l, and SORTHEAD for pipelines that sort then truncate.

**Algorithm 2** Sharded-Parallel Corpus Search **Input:** command containing pipe (|) $c$; corpus $\mathcal{C}$ split into $S$ contiguous shards $C_i$ s.t. $\uplus_i C_i = \mathcal{C}$ **Output:** Byte-exact output identical to sequential execution of $c$ on $\mathcal{C}$ 1: $(s_1, \dots, s_m) \leftarrow \text{DECOMPOSE}(c)$ $\quad \triangleright$ Split pipeline into $m$ distinct stages based on pipe operator(|) 2: $(\tau, N) \leftarrow \text{CLASSIFY}(s_1, \dots, s_m)$ $\quad \triangleright$ Classify the type of reduction 3: **if** $\tau = \text{SEQUENTIAL}$ **then** return $\text{EXEC}(c, \mathcal{C})$ $\quad \triangleright$ Run on full corpus if it can only be sequentially 4: **parallel for** $C_i \in \mathcal{C}$ **do** $R_i \leftarrow \text{EXEC}(c, C_i)$ $\quad \triangleright$ Perform the operation on each shard in parallel 5: **if** $\tau = \text{CONCAT}$ **then** return $\uplus_i R_i$ $\quad \triangleright$ Concat result of operations 6: **else if** $\tau = \text{HEAD}$ **then** return $\text{TOP}(\uplus_i R_i, N)$ $\quad \triangleright$ Concat result of operations and select top $N$ 7: **else if** $\tau = \text{COUNT}$ **then** return $\sum_i \text{Int}(R_i)$ $\quad \triangleright$ Add results of operations 8: **else if** $\tau = \text{SORTHEAD}$ **then** return $\text{TOP}(\text{Merge}(R_1, \dots, R_S), N)$ $\quad \triangleright$ Merge results based on value and return top $N$ 9: **function** $\text{CLASSIFY}(s_1, \dots, s_m)$ $\quad \triangleright$ $m$ is the final stage of the pipeline 10: **if** $s_1 \notin \{rg, \text{grep}\} \lor \exists s_j \text{ s.t. } \text{UNSAFE}(s_j)$ **then** 11: return $(\text{SEQUENTIAL}, \emptyset)$ $\quad \triangleright$ Reject non-search or cross-line context 12: **end if** 13: $\mathcal{S} \leftarrow \{s_i \in (s_1, \dots, s_m) \mid \text{STATELESS}(s_i)\}$ $\quad \triangleright$ e.g., cut, tr, line-wise sed 14: **if** $s_m = \text{head -n } N \land \forall j < m, s_j \in \mathcal{S}$ **then** 15: return $(\text{HEAD}, N)$ $\quad \triangleright$ Pattern: Stateless maps $\rightarrow$ early termination 16: **else if** $s_m = \text{wc -l} \land \forall j < m, s_j \in \mathcal{S}$ **then** 17: return $(\text{COUNT}, \emptyset)$ $\quad \triangleright$ Pattern: Stateless maps $\rightarrow$ global line count 18: **else if** $s_{m-1} \in \{\text{sort, sort|uniq}\} \land s_m = \text{head -n } N$ **then** 19: return $(\text{SORTHEAD}, N)$ $\quad \triangleright$ Pattern: Stateless maps $\rightarrow$ Top-$K$ filter 20: **else if** $\forall j \leq m, s_j \in \mathcal{S}$ **then** 21: return $(\text{CONCAT}, \emptyset)$ $\quad \triangleright$ Pattern: Entire pipeline is purely stateless 22: **else** 23: return $(\text{SEQUENTIAL}, \emptyset)$ $\quad \triangleright$ Revert unsupported complex pipelines 24: **end if** 25: **end function**

Read the original paper

Open the simplified reader on Paperglide