OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

OmniRetrieval unifies access to heterogeneous knowledge sources by dispatching native queries instead of flattening data into a shared embedding space.

How can a single LLM-based system effectively retrieve information across heterogeneous knowledge sources that require different query languages (SQL, SPARQL, Cypher, Search)?

Real-world questions often require data from diverse sources like relational databases, knowledge graphs, and unstructured text, but existing retrieval systems are siloed into single-modality interfaces. OmniRetrieval solves this by acting as an overarching layer that identifies relevant sources and dispatches queries in each source's native language—SQL, SPARQL, Cypher, or free-form text—preserving the structural affordances of the original data. Across a benchmark of 309 distinct knowledge bases, this framework consistently outperforms single-source baselines while maintaining high accuracy in source selection and query generation.

Paper Primer

The core move is to avoid "homogenization"—the common practice of projecting all data into a single dense embedding space, which erases structural nuances like schemas and joins. Instead, OmniRetrieval treats each source as a black-box engine: it selects the right source, generates a native query conditioned on that source's specific structural context, and consolidates the heterogeneous results.

OmniRetrieval consistently exceeds single-source baselines across all four backend types.

Macro-averaged performance across 13 datasets and 309 knowledge bases, comparing the framework against paradigms constrained to a single modality. The framework successfully coordinates access to diverse backends (SQL, SPARQL, Cypher, and unstructured text) without requiring a shared encoder or retraining for new sources.

Why is a unified embedding space insufficient for this problem?

Projecting diverse sources into a shared space creates a "modality gap" where embeddings cluster by source type rather than semantic content, and it discards the native query operations (like joins or path traversals) that give structured sources their expressive power.

How does the framework handle the addition of new knowledge sources?

Adding a new source is a matter of registration alone; because the framework interacts with sources via their native interfaces, it requires no shared encoder to retrain and no embedding space to redraw.

Researchers can now build retrieval systems that treat heterogeneous knowledge as a unified ecosystem without sacrificing the structural integrity of the underlying databases.

Introduction and Motivation

We expose the core retrieval gap caused by incompatible query languages across diverse knowledge sources.

Real‑world questions often require evidence from multiple, structurally distinct knowledge sources, yet each source speaks its own query language. Flattening all sources into a single embedding is like reducing a Swiss‑army knife to a plain screwdriver – you lose the specialized tools each source offers.

Different knowledge bases expose data through incompatible query languages (SQL, SPARQL, Cypher, free‑text search), so a single retriever cannot address a query that spans more than one source.

SQL query selects rows where state = ‘CA’ and department = ‘cardiology’ – returns 12 hospitals.

Free‑text search over the literature index retrieves 45 papers matching “COVID‑19 2022”.

SPARQL pattern finds 8 research projects linked to the selected hospitals.

Intersecting the three result sets yields 3 hospitals that satisfy all constraints.

Even this tiny example already requires three distinct native queries; a single 200 GB unified embedding would be wasteful and would lose the relational and graph constraints that make the answer precise.

OmniRetrieval addresses the gap by first selecting the relevant sources for a given natural‑language query, then generating a native query for each source, executing them, and finally consolidating the results into a unified evidence set.

**Figure 1.** Different knowledge sources offer distinct structural affordances and query languages (left); OmniRetrieval meets each on its own terms via source selection, native query formulation, and cross-source condensation (right).

The fundamental incompatibility of query languages across knowledge sources prevents a single retriever from answering multi‑source questions.

The OmniRetrieval Framework

OmniRetrieval routes a user question to the right backend and returns unified evidence.

OmniRetrieval stitches together heterogeneous backends by routing each user question to the right source and letting each source answer in its own language – like a multilingual concierge that forwards a request to the specialist who speaks that language.

How does OmniRetrieval differ from a single unified query language that pretends all sources speak the same syntax?

OmniRetrieval never forces a source to translate its data model; instead it lets each backend answer in its own language, so joins stay joins in SQL, graph traversals stay traversals in SPARQL or Cypher, and free‑text search stays BM25‑style matching. A single unified language would have to approximate those operators, losing precision.

The LLM reads the question “How many books did Alice write?” and all three descriptors.

It ranks $b_1$ and $b_2$ as most relevant, returning $S=\{b_1,b_2\}$.

For $b_1$, the prompt template produces the SQL query `SELECT COUNT(*) FROM Books WHERE author = 'Alice'`.

For $b_2$, the template produces the SPARQL query `SELECT (COUNT(?book) AS ?cnt) WHERE { ?book authorOf "Alice" }`.

Each query is executed, yielding counts $5$ (SQL) and $5$ (SPARQL).

The selector merges the two identical counts into the final evidence set $\mathcal{E}=\{5\}$.

Even though the sources differ wildly, the framework can combine their answers because each native query respects the source’s own data model.

Source Selection reads every source’s structural descriptor together with the user question and lets a large language model rank which sources are likely to contain the answer.

Why not embed each descriptor into a shared vector space and rank by cosine similarity?

Descriptors have different syntactic forms (SQL schema vs. RDF ontology vs. free‑text summary). A single encoder would have to flatten them, losing the precise tokens that signal relevance—e.g., a table name “Orders” versus a predicate “orderOf”. The LLM keeps the raw text, preserving those signals.

The LLM receives the concatenated string of $q$ and the three descriptors.

It scores $c_{1}$ highest (direct table match), $c_{2}$ second (semantic match via “Invoice”), and $c_{3}$ lowest.

With $k=2$, it returns $S=\{c_{1},c_{2}\}$.

Both sources are passed to the query‑generation stage.

Even though $c_{3}$ mentions “finance”, the model correctly discards it because the question explicitly needs a structured amount field.

Native Query Translation uses a per‑source prompt template so the same LLM can emit a valid SQL, SPARQL, or Cypher query that references the source’s own tables, predicates, or relationship types.

Does the LLM need to know the full schema to write a correct SQL query?

No. The prompt supplies the relevant portion of the schema (e.g., the table and column names needed for the question). The LLM can therefore generate a correct query without loading the entire database definition.

The prompt template inserts the question, the schema description, and the instruction “output a PostgreSQL query”.

The LLM returns: `SELECT title FROM Books WHERE author = 'Bob';`.

The query is sent to $\text{Exec}(b,\hat{q}_b)$, which returns rows {(‘Deep Learning’), (‘AI Foundations’)}.

The generated query directly references the column `author`, showing that the LLM can ground the natural‑language request in the exact schema elements.

The framework treats SQL, SPARQL, Cypher, and free‑text search as distinct retrieval paradigms, each exposing its native operators (joins, triple patterns, path traversals, or BM25) to the downstream evidence selector.

Could we translate a SPARQL query into SQL instead of generating it directly?

In principle yes, but the translation would have to emulate graph pattern matching with joins, which is non‑trivial and often lossy. Generating the native SPARQL query lets the RDF engine handle pattern matching natively, preserving semantics.

SQL query: `SELECT title FROM Books WHERE author = 'Bob';` returns rows.

SPARQL query: `SELECT ?title WHERE { ?book ex:author "Bob" . ?book ex:title ?title . }` returns bindings.

Cypher query: `MATCH (b:Book {author: 'Bob'}) RETURN b.title;` returns nodes.

Search query: the raw text “titles by Bob” is sent to a BM25 index, returning top‑k passages.

Each paradigm yields a different data shape (rows, triples, paths, passages), yet all answer the same information need, illustrating why the selector must understand all four.

Encode the user question $q$ and concatenate it with all source descriptors $\{c_b\}_{b\in\mathcal{B}}$.

Run a long‑context LLM to produce a ranked subset $S = \text{LLMselect}(q,\{c_b\};k)$.

For each $b\in S$, invoke the per‑source prompt template $T_b$ to generate a native query $\hat{q}_b = \text{Generate}_b(q,c_b)$.

Execute each native query with the source’s engine $\text{Exec}(b,\hat{q}_b)$.

Apply the selector $\mathcal{E} = \text{Select}(q,\{\text{Exec}(b,\hat{q}_b)\}_{b\in S})$ to produce the final evidence set.

Experimental Setup

OmniRetrieval tops all three evaluation metrics on the benchmark.

OmniRetrieval achieves the highest scores on all three evaluation metrics.

Table 3 shows OmniRetrieval attaining 68.58 % source selection, 46.62 % retrieval, and 69.72 % judge accuracy, surpassing every baseline.

**Table 3.** Results for the unified-representation methods (Oguz et al., 2022; Ma et al., 2022) with the constrained setup on GPT-5.4, marked † as non-comparable.

We evaluate OmniRetrieval on 13 datasets spanning four native backends, totaling 309 knowledge bases. For each dataset we sample 300 questions and measure source selection, retrieval, and judge accuracy, holding the backbone model constant across baselines. The methods compared include single‑backend baselines, KB Routing, the proposed OmniRetrieval, and an Oracle upper bound.

Main Results

OmniRetrieval outperforms all baselines across every retrieval paradigm.

OmniRetrieval attains 67.5% evidence‑selection accuracy when choosing among three candidate sources.

Figure 3 reports 67.5% accuracy for k = 3, surpassing the next best method by a few points.

**Table 1.** Main results, with each metric macro-averaged across the four retrieval paradigms. Best results among the comparable methods are bolded; second best are underlined. Oracle is a upper bound with perfect source selection.

Appendices

The section spells out the heterogeneous‑source retrieval problem and why a routing‑and‑translation pipeline is required.

Real‑world applications must pull information from search indexes, relational tables, RDF graphs, and graph databases, each with its own query language. A naïve approach would require hand‑crafting a separate client for every source, which does not scale. The core challenge is to automatically route a user question to the appropriate backend and then emit a native query that the chosen source can execute.

Given a user question, the router scores each backend for compatibility, then returns a short list of the most promising routes.

The router evaluates language cues: “directed” and “Inception” match a relational schema (SQL) and a movie graph schema (CYPHER) with equal confidence 0.78.

It also finds a weak match in the generic search index (SEARCH) with confidence 0.42.

Sorting scores yields SQL (0.78) and CYPHER (0.78) as the top two; the router emits these two routes in JSON.

Even a simple question can generate multiple high‑confidence routes, illustrating the combinatorial routing space the system must manage.

Read the original paper

Open the simplified reader on Paperglide