TRL-BENCH: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Wei Pang, Xiangru Jian, Hehan Li, Zhixuan Yu, Alex Xue, Jinyang Li, Zhengyuan Dong, Xinjian Zhao, Hao Xu, Chao Zhang, Reynold Cheng, M. Tamer Özsu, Tianshu Yu

TRL-BENCH standardizes tabular encoder evaluation by isolating reusable representation quality from task-specific wrappers.

How do heterogeneous tabular encoders compare when evaluated under a unified, representation-level protocol rather than task-specific end-to-end pipelines?

Tabular encoders are typically evaluated within end-to-end pipelines, making it impossible to isolate the quality of the representation itself from the specific task wrapper or training budget. TRL-BENCH standardizes this by requiring models to export row, column, or table embeddings, which are then evaluated using shared, lightweight downstream probes across three multi-granular suites. Across 20 models and 16 tasks, the benchmark reveals that encoder quality is capability-specific rather than universal, and that top-performing pipelines rely on non-additive compositional fit between specialized encoders.

Paper Primer

The benchmark addresses the "comparability gap" in tabular representation learning (TRL). By forcing heterogeneous models—ranging from generic text encoders to tabular specialists—to export their internal representations, the authors decouple the encoder's inherent signal from the noise of task-specific fine-tuning.

TRL-BENCH operates as a modular evaluation protocol: it treats retrieval, schema alignment, and row matching as atomic capabilities. The core move is the use of fixed, lightweight downstream heads (linear probes and MLPs) that act as a "stress test" for the exported embeddings, ensuring that performance reflects the encoder's ability to capture reusable signal rather than the head's capacity to overfit.

Encoder performance is capability-specific, not captured by a single leaderboard.

Generic text encoders dominate tasks with strong surface-text signals, while tabular specialists outperform them only when their pretraining objectives align with the specific task geometry (e.g., union search or schema matching). In TRL-CTBENCH, generic text encoders like BERT show a normalized rank (NR) of 0.000 on schema tasks but degrade to 0.397 on grounding tasks, where structural tabular models take the lead.

Compositional fit determines end-to-end pipeline quality in data-lake enrichment.

In TRL-DLTE, the best pipelines are hybrids that combine capability-matched specialists; simply assembling the top-ranked model from each stage (retrieval, alignment, matching) results in significantly lower performance than optimal compositional pairings. The best dev-selected hybrid pipeline (TUTA/GTE/GTE) achieves a UJ-H score of 0.229, compared to 0.134 for the assembly of the top marginal leaders.

Why is this benchmark needed if we already have end-to-end tabular task benchmarks?

Existing benchmarks conflate the encoder's quality with the task-specific wrapper and fine-tuning process. TRL-BENCH isolates the representation, allowing researchers to determine if a model's success is due to its architecture or merely its task-specific adaptation.

Does this benchmark apply to all tabular models?

It applies to any model that exposes reusable row, column, or table embeddings, either natively or via a natural extraction point. It excludes generative table LLMs that do not provide such an interface and heavily fine-tuned systems that are strictly end-to-end predictors.

Researchers should shift from seeking a "universal" tabular encoder to optimizing for compositional fit, as the best downstream systems are built by matching specialized encoders to specific stages of the tabular pipeline.

Abstract

TRL‑BENCH unifies evaluation of tabular encoders by probing frozen embeddings across three benchmark suites.

TRL‑BENCH is a multi‑granular benchmark that standardizes representation‑level evaluation of tabular encoders by exporting row, column, and table embeddings to shared lightweight probing heads. It comprises three suites—TRL‑CTBENCH (column/table), TRL‑RBENCH (row), and TRL‑DLTE (full compositional pipeline)—and provides curated assets including 50 OpenML tables, 16 row‑pair rewrites, and a 47,772‑table DLTE lake. Across 20 models and 16 tasks the results show that encoder quality is capability‑specific, with generic text encoders leading on surface‑text signals, tabular specialists excelling when pretraining aligns, and the best end‑to‑end pipelines requiring non‑additive composition of matched specialists.

Introduction

We expose the gap in comparing tabular encoders by proposing a shared protocol that evaluates frozen embeddings.

Tables are the canonical structure for storing relational data, and recent work has produced powerful row, column, and table encoders. In encode‑once, reuse‑many scenarios, the embedding itself—not the downstream predictor—is the true object of evaluation. Yet most prior evaluations embed the encoder inside task‑specific pipelines, conflating encoder quality with wrapper choices and training budgets.

Evaluating encoders inside end‑to‑end pipelines hides their intrinsic representation power, so a shared protocol that freezes embeddings is needed to compare heterogeneous tabular models fairly.

TRL‑BENCH operationalizes this protocol by exporting row, column, or table embeddings from each model and feeding them to lightweight shared readouts. The benchmark defines five atomic capabilities—retrieval, schema alignment, linkage, prediction, and grounding—and bundles them into three suites: TRL‑CTBENCH (column/table transfer), TRL‑RBENCH (row transfer), and TRL‑DLTE (compositional data‑lake enrichment).

Across 20 models and 16 tasks, three patterns emerge: capability‑specific transfer, a split between within‑table prediction and noisy cross‑table linkage, and the superiority of hybrid pipelines that match capabilities. These findings motivate our four contributions: a standardized cross‑paradigm protocol, a comprehensive benchmark covering 87 datasets, curated assets and task reformulations, and an empirical study revealing no universal tabular representation.

Related Work and Positioning

We position TRL‑BENCH among prior tabular model families and benchmarks.

Prior work splits tabular models into distinct families and evaluates them with task‑specific pipelines.

Tabular encoders are grouped by the granularity of the representations they emit—row‑level, column/table‑level, or higher‑level text‑style embeddings.

Models that produce a vector for each table row, optimized for supervised prediction tasks.

Encoders that emit embeddings for individual columns or the entire table, capturing schema and relational information.

Standard language models applied to linearized table text, yielding sentence‑level embeddings.

Models pretrained on large collections of tables to learn reusable table‑specific representations.

Hybrid architectures that combine a table encoder with a language model to produce joint embeddings.

Encoders that explicitly model table structure (rows, columns, cells) using graph or transformer‑style mechanisms.

Models that focus on column‑wise semantics, typically for type inference or column clustering.

Table 1 reports performance of these families across Schema, Join, Union, and Grounding tasks, highlighting the strong results of the Generic Text family.

The TRL-BENCH Protocol

Describes the standardized protocol and three benchmark suites used to evaluate tabular encoders.

The benchmark asks a single comparability question: once heterogeneous tabular encoders are evaluated under a shared representation‑level protocol, how do they differ across rows, columns, and tables, and how do they compose end‑to‑end?

We treat each encoder as a black box that emits a fixed‑size embedding for rows, columns, or whole tables; downstream tasks then probe these embeddings with lightweight heads.

A collection of 13 column‑ and table‑level tasks that evaluate how well exported embeddings support schema understanding, joinability, unionability, and grounding.

A row‑level suite that measures whether a single row embedding can be reused for multiple target columns and for cross‑table entity matching.

An end‑to‑end pipeline that tests whether row, column, and table embeddings can be composed to reconstruct a fragmented parent table from a data lake.

Lightweight downstream modules that are either completely training‑free or trained on a tiny probe head, keeping the evaluation focus on the exported representation.

Define a set of source tables and fragment them according to the suite (column, row, or table granularity).

Run each tabular encoder to export the required embeddings $E_{\text{col}}$, $E_{\text{row}}$, $e_{\text{tbl}}$.

Apply the appropriate frozen probing module (training‑free, learned, or query‑conditioned) to each task.

Compute task‑specific metrics (accuracy, F1, UJ‑H) and aggregate per‑suite scores.

Report the strongest aggregation for models that support multiple table‑level pooling strategies.

Train a linear probe on the three embeddings with labels {numeric, date, text} for 5 epochs.

Train a shallow MLP (hidden size 256) on the same data for 5 epochs.

Average the two probe accuracies (linear = 92 %, MLP = 95 %) to obtain the final task score of 93.5 %.

This concrete run shows how the protocol isolates representation quality: the same encoder is evaluated with both a purely linear readout and a modestly nonlinear one, and the averaged score reflects both linear accessibility and slight nonlinearity.

**Figure 1.** TRL-Bench at a glance. Each model is processed once through its supported wrapper to export row-, column-, or table embeddings, and shared lightweight modules then evaluate those embeddings across TRL-CTBENCH (schema, joinability, unionability, grounding), TRL-RBENCH (row prediction, record linkage), and TRL-DLTE (multi-stage data-lake enrichment).

**Figure 2.** Curation of TRL-RBENCH row-prediction tables and assembly of the TRL-DLTE lake. (a) Row prediction curation: 158 candidate tables filtered through rule screening, degeneracy audit, and human review with label repair into 50 tables with 123 targets. (b) DLTE lake assembly: 1,379 TabFact/WTQ parents fragmented into seed queries and union/join targets at four noise tiers. 11,032 targets are embedded alongside 36,740 CKAN distractors in a 47,772-table lake.

Empirical Results and Analysis

We evaluate frozen tabular embeddings across benchmarks to expose their true representation-level performance.

Capability‑matched hybrid pipelines outperform the best monolithic encoder by a sizable margin.

The dev‑selected hybrid TUTA/GTE/GTE achieves 0.229 UJ‑H, which is 0.090 higher than the top monolithic BERT/BERT/BERT pipeline (0.139 UJ‑H).

Column‑ and table‑level experiments reveal two clear patterns. First, generic‑text encoders excel when task signals reside in headers or short cell strings, while they fall behind specialist models on tasks requiring cross‑table alignment. Second, models whose pretraining objective aligns with the downstream task (e.g., contrastive objectives for union search) consistently outperform those relying on surface text alone.

Row‑level results show three complementary observations. Prediction quality and linkage performance separate by model family: prior‑based TABICL dominates prediction, whereas transfer‑oriented encoders lead linkage. Moreover, intra‑table transfer (row prediction) benefits models trained per target table, while inter‑table transfer (record linkage) favors models that share a single representation across tables. Finally, geometric diagnostics such as embedding anisotropy and effective rank correlate strongly with linkage and regression utility respectively.

Compositional analysis of the DLTE benchmark uncovers three salient findings. Hybrid pipelines that match capabilities across stages beat any single‑encoder reuse, confirming that stage‑wise strength alone does not guarantee optimal composition. Marginal scores capture average main effects but lose critical compatibility information, as seen by the 0.097 UJ‑H gap between marginal assemblies and the best pipeline. Lastly, the row‑model ranking in DLTE aligns closely with the RBench robust linkage ranking, indicating a shared identity‑resolution capability.

**Figure 3.** **DLTE pipeline landscape** (test split). Axes sort by per-stage marginal $UJ-H$. Color encodes end-to-end $UJ-H$. Halo boxes mark marginal and dev-selected compositions.

**Table 4.** A cross-paradigm empirical study: across 20 models and 16 tasks, we show that no single pretraining recipe behaves as a universal tabular representation, and we identify structural gaps in model choice, transfer scope, and pipeline composition that single-paradigm or single-granularity evaluations cannot isolate.

Appendix

Appendix details the benchmark scope, model inventory, protocol adaptations, and diagnostic analyses.

TRL‑BENCH evaluates frozen, exported embeddings across a shared protocol rather than reporting end‑to‑end system scores; the benchmark therefore answers a narrower question about representation‑level capability.

Section B expands the related‑work narrative, distinguishing row‑level, column‑ and table‑centric representation learning, and broader tabular ML threads, and explains why these prior works are not direct baselines for the multi‑granular transfer focus of TRL‑BENCH.

Section C lists the 20 models in the inventory, their granularity support, family, adaptation regime, pretraining source, hidden dimension, and parameter count (see Table 5).

Section D specifies that each encoder is evaluated under its documented public wrapper and preprocessing regime; the benchmark does not force a common serialization because that would push most models outside their validated operating range.

Section E enumerates protocol adaptations required for frozen multi‑granular transfer, such as table‑disjoint splits for pair tasks, the TUS‑hard variant for union search, row‑disjoint strict‑test for record linkage, removal of label‑equivalent columns, and the use of UJ‑H as the primary DLTE metric.

Section F provides the full dataset inventory, including counts of rows, columns, and cell footprints for each suite (Table 14) and the joint density visualisations in Figures 5 and 6.

Section G describes the task‑local baselines (Random, TF‑IDF, embedding‑free matchers, Chance, Dummy) and their roles as controls for probing whether the downstream pipeline or the encoder provides the signal (Table 24).

Section H defines all task metrics (F1, AUROC, nRMSE, MAP, R@GT, MRR, Accuracy, UJ‑H) and the normalized‑rank aggregation used to compare models across heterogeneous tasks (Equation 1).

Section I explains how table‑footprint coverage is measured (row count × column count) and how distinct roles (feature tables, lake tables, query tables) are counted once per role (Figure 5, Figure 6, Table 14).

Section J repeats the same coverage analysis for the omitted figures, confirming that the joint density and marginal distributions fully characterize the benchmark’s input space.

Section K presents CTBench diagnostics: lexical‑accessibility proxies (Table 15) show where generic text encoders excel, probe‑head complexity (Table 16) isolates linear versus MLP contributions, and aggregation ablations (Tables 17‑18) reveal task‑dependent preferences for CLS, COL‑MEAN, or TOK‑MEAN.

Section L details RBench diagnostics: regime‑wise normalized‑rank effects (Table 25), embedding‑dimension scaling for record linkage (Table 26), probe‑head comparisons (Table 27), and strict‑test ablations confirming that row overlap does not alter model rankings (Table 29).

Section M specifies the full DLTE pipeline (Algorithm 1) and its three stages: table retrieval (FAISS inner‑product index), column alignment with Hungarian assignment and calibrated thresholds (Table 35), and row matching using CSLS with reciprocal‑top‑1 filtering and profile‑driven union/join operations (Tables 36‑38).

Subsection M.1 describes Stage‑1 retrieval: L2‑normalised table embeddings are indexed with FAISS IndexFlatIP and the top K = 100 candidates are returned.

Subsection M.2 details Stage‑2 alignment: Hungarian assignment on cosine distances yields matched similarities; five thresholds ($\tau_{\text{floor}},\tau_u,\tau_{\text{us}},\tau_{\text{jm}},\tau_{\text{ks}}$) are grid‑searched per (Stage‑1, Stage‑2) pair to maximise macro‑F1 (Table 35).

Subsection M.3 explains Stage‑3 row matching: CSLS similarity (k = 5) is computed, reciprocal top‑1 pairs are filtered by standardized confidence scores, and union or join merges are applied according to the highest‑confidence candidate (Tables 36‑38).

Compute F1 on the original pair‑random split: 0.736.

Re‑evaluate the same model on a table‑disjoint split, enforcing that no table appears in both train and test: 0.523.

Observe the $\Delta$F1 = 0.736 − 0.523 = 0.213, indicating a substantial loss of cross‑table generalisation.

The table‑disjoint protocol reveals that many encoders overfit to intra‑table patterns; only models that capture truly transferable representations maintain performance under this stricter split.

Conclusion

TRL‑BENCH shows how frozen embeddings let diverse tabular encoders be compared directly.

TRL‑BENCH reframes tabular encoder evaluation around the artifact many downstream systems actually reuse: exported embeddings. Under a single representation-level protocol, heterogeneous encoders become comparable without conflating their embeddings with task‑specific wrappers, retraining budgets, or adaptation. By making this setting measurable across diverse encoders, datasets, and downstream uses, TRL‑BENCH provides a common reference point for building tabular models as portable representation learners rather than one‑off task solvers.

Read the original paper

Open the simplified reader on Paperglide