GENEB: Why Genomic Models Are Hard to Compare

Daria Ledneva, Mikhail Nuridinov, Denis Kuznetsov

GENEB provides a unified, large-scale diagnostic benchmark to resolve the fragmentation and non-comparability of genomic foundation models.

Why is it currently impossible to reliably compare genomic foundation models, and what happens when we evaluate them under a unified, controlled benchmark?

Genomic foundation models are currently evaluated on disjoint benchmarks with incompatible protocols, making it impossible to determine which architectures or pretraining strategies actually drive performance. The authors introduce GENEB, a diagnostic framework that evaluates 40 diverse genomic models on 100 tasks across 13 functional categories using a unified linear-probing protocol. Aggregate leaderboards are unstable: model rankings shift sharply across task categories, and non-scale design choices like architecture and pretraining alignment often outweigh parameter count.

Paper Primer

GENEB isolates representation quality by freezing model weights and using lightweight linear classifiers to assess performance. This approach enables controlled, matched comparisons across architecture, tokenization, and pretraining corpus, effectively stripping away the confounding variables that plague current cross-paper evaluations.

Model scale is a poor proxy for performance in genomic machine learning.

Across 36 in-domain models, 31 instances exist where a model at least 5× smaller outperforms a larger counterpart in aggregate Matthews Correlation Coefficient (MCC). The correlation between parameter count and performance is significant but inconsistent, with non-scale design choices frequently offsetting substantial scale gaps.

Architecture and pretraining alignment are primary drivers of performance on high-variance tasks.

On tasks where models disagree most (standard deviation > 0.12), multi-species and eukaryotic-gene pretraining capture 32 of 39 top-tier placements, while human-only pretraining and state-space models dominate the bottom tier. Category-level gaps between architectures can be several-fold larger than the aggregate gain from scaling models above 1B parameters.

Why does this benchmark matter if we already have leaderboards for genomic models?

Existing benchmarks are fragmented, use incompatible protocols, and typically evaluate only a small subset of models. GENEB provides a unified reference framework that exposes task-level trade-offs, preventing practitioners from relying on aggregate rankings that mask poor performance in specific biological domains.

What is the scope of GENEB, and where might it fail to provide guidance?

GENEB covers 100 tasks across 13 eukaryotic-skewed categories. It is not a reliable proxy for prokaryotic or viral genomics, and it currently underrepresents tasks requiring long-range regulatory interactions (>10 kb), meaning models with specialized long-context architectures may not be fully exercised.

Practitioners should abandon aggregate leaderboards in favor of category-aware model selection. GENEB demonstrates that for most genomic tasks, pretraining scope and architectural inductive biases are more decisive for success than raw parameter count.

The Fragmented Landscape of Genomic Models

Fragmented benchmarks hide true progress in genomic foundation models.

Progress in genomic foundation models is obscured by fragmented benchmarks, incompatible protocols, and task‑specific reporting, making direct model comparison impossible. The paper’s central premise is that without a unified evaluation like GENEB, claims of superiority are unreliable and highly task‑dependent.

**Figure 1.** Fragmented comparison landscape of genomic foundation models. Each node represents a published model; directed edges denote models explicitly used as baselines or comparators in the corresponding paper. The sparse, disconnected graph reflects the absence of unified cross-model evaluation in genomic machine learning.

A patchwork of isolated benchmark suites and ad‑hoc protocols that prevents apples‑to‑apples model comparison.

The lack of standardized benchmarks prevents meaningful progress in genomic foundation models.

Prior Evaluation Paradigms

A survey of prior genomic model architectures, tokenization, and benchmarks.

Early genomic models mainly used Transformer encoders trained with masked language modeling. More recent work explores decoder‑only and generative architectures for unified sequence modeling and long‑context processing, as well as long convolutions, state‑space models, and hybrid designs that combine multiple paradigms.

Tokenization strategies range from single‑nucleotide and k‑mer vocabularies to learned BPE vocabularies, each balancing resolution against efficiency. Pretraining data varies from human‑only and species‑specific corpora to broad multi‑species and domain‑focused datasets, with prior work reporting benefits for both diversity and specialization depending on the downstream task.

Several benchmarks evaluate genomic foundation models—e.g., Nucleotide Transformer tasks, GUE/GUE+, Genomic Benchmarks, BEND, and DNALongBench—but they differ in task design, evaluation protocols, and typically assess only a limited subset of models, making cross‑paper comparison difficult.

Recent comparative studies usually examine a small number of representative architectures on predominantly human‑centric tasks. Platform‑based efforts such as OmniGenBench provide dynamic leaderboards but still include a limited and evolving set of baselines, leaving many newer DNA‑specific models unevaluated.

GENEB (Genomic Evaluation Benchmark) fills these gaps by evaluating 40 foundation models on 100 DNA classification tasks across 13 functional categories under a unified probing‑based protocol. The resulting performance matrix reveals highly task‑dependent trade‑offs and provides a community reference analogous to MTEB in NLP.

A fuller discussion of prior benchmarks, comparative studies, and architectural trends is provided in Appendix A.

The GENEB Benchmark

We detail the experimental variables, controls, and metrics used to assess model scale versus performance.

We evaluated 40 DNA foundation models on 100 genomic prediction tasks spanning 13 functional categories. All statistics reported below are aggregated Matthews Correlation Coefficient (MCC) scores within the GENEB benchmark. This systematic analysis lets us isolate the impact of model scale, architecture, tokenization, and pre‑training data.

GENEB is a unified evaluation suite that measures how well genomic foundation models perform across a broad set of prediction tasks.

Vary model scale (parameter count), architecture type (Transformer, Mamba‑SSM, etc.), tokenization scheme, and pre‑training corpus.

Hold constant the GENEB evaluation protocol: same 100 tasks, same train/validation/test splits, and MCC as the performance metric.

Measure aggregate performance using both macro‑averaged MCC (equal weight per functional category) and micro‑averaged MCC (equal weight per task).

Record Spearman rank correlations between log‑parameter count and MCC for each functional category.

Identify outlier models (e.g., prokaryotic‑only EVO‑1‑131K) and repeat correlation analysis after their removal.

Across all 40 models, log‑parameter count correlates positively with aggregate macro‑MCC ($\\rho = 0.565$, $p < 0.001$). Excluding the prokaryotic‑only outlier EVO‑1‑131K strengthens this relationship to $\\rho = 0.685$ ($p < 0.001$), confirming that scale is a substantial predictor of overall performance.

**Figure 2.** Pareto frontier of model efficiency: macro-MCC vs. parameter count. Each point represents one of the 40 genomic foundation models, with parameter count on a logarithmic x-axis and full-shot macro-average MCC on the y-axis. Marker size and color both encode macro-MCC. The dashed line marks the Pareto frontier of best performance–size trade-offs. Spearman correlation between log(params) and macro-MCC is $\rho = 0.565$ ($p < 0.001$); excluding the prokaryotic-only outlier Evo-1-131k raises this to $\rho = 0.685$ ($p < 0.001$). While scale is a substantial predictor of aggregate performance, several large models fall below the frontier, indicating that architecture and pretraining choices can offset substantial scale differences (see Section 4, Table 1).

A striking counterexample is MUT‑BERT (86 M parameters, Transformer‑encoder), which surpasses the much larger ECCDNAMAMBA (1 B parameters, Mamba‑SSM) by $+0.110$ macro‑MCC despite an 11.6× size gap. This demonstrates that architectural choices can outweigh raw scale.

**Figure 3.** Model performance across task groups. Heatmap shows full-shot MCC averaged within each task group for 40 genomic foundation models, sorted by overall full-shot macro-average MCC. Cell values report category-level mean MCC, with colors ranging from red/orange for lower scores to green for higher scores. The results reveal substantial task-level heterogeneity: some categories, such as promoter, coding/non-coding, and species-classification tasks, are consistently easier, whereas DNA methylation, lncRNA, virus/phage, and regulatory tasks remain challenging. This category-specific structure shows that aggregate model rankings can hide important differences in downstream behavior.

Category‑level scaling correlations (Table 1) show significant positive trends in 11 of 13 categories, with Spearman $\\rho$ ranging from $0.345$ (DNA methylation) to $0.579$ (histone modifications). The remaining two categories exhibit weak or non‑significant relationships, highlighting that scale benefits are not uniform across tasks.

Isolating Architectural Effects

We isolate how architecture and tokenization affect performance under matched conditions.

To understand how model design drives performance we evaluate only those pairs that share the same pretraining corpus and tokenization scheme. This isolates architectural effects from scale, data, and vocabulary differences.

When architecture changes while everything else is held constant, any observed performance shift may stem from hidden interactions rather than the architecture itself.

Select two models that share the same multi‑species pretraining corpus and the same BPE tokenization.

Verify that all remaining hyperparameters (learning rate, batch size, optimizer) are identical.

Run each model on the full GENEB benchmark and compute macro‑MCC for every task.

Record the MCC of each model and calculate the difference (Transformer – SSM or Encoder – Decoder).

Repeat the process for tokenization‑only experiments, keeping architecture fixed.

Aggregate results across tasks and report both micro‑ and macro‑averaged gaps.

Both models are trained for the same number of steps with identical optimizer settings.

Evaluation yields macro‑MCC = 0.657 for OMNI‑DNA‑1B and 0.302 for ECCDNAMAMBA.

The gap is 0.657 – 0.302 = **0.355**, indicating a strong Transformer advantage on this cross‑species task.

This single trial illustrates how architecture can dominate performance when the task requires cross‑species generalisation.

Pretraining Corpus Influence

We measure how pretraining corpora influence MCC across matched model configurations.

To isolate the impact of pretraining data we evaluate matched model pairs while holding architecture, tokenization, and size constant (within ±2×). This controlled design removes architectural confounds and lets corpus effects surface in the measured $\Delta$ MCC.

Different pretraining corpora shift model performance, but the direction and magnitude of the shift depend on the downstream task.

Select model pairs that share the same architecture (Transformer encoder or decoder), tokenization scheme (BPE or k‑mer), and model size within a factor of two.

For each pair, pretrain one model on a human‑only corpus and the other on a multi‑species corpus (or the alternative corpus variant).

Fine‑tune both models on each GENEB task category using identical hyperparameters.

Compute the Matthews Correlation Coefficient (MCC) for each task and average across tasks to obtain a macro‑MCC per model.

Calculate the per‑category difference $\Delta$ MCC = MCC$_{\text{multi-species}}$ − MCC$_{\text{human}}$ and record the number of pairs where the multi‑species model wins.

Aggregate the $\Delta$ MCC values across all controlled pairs to produce the numbers shown in Table 2.

After fine‑tuning on the chromatin‑accessibility task, Model A attains MCC = 0.71 while Model B attains MCC = 0.77.

The per‑category $\Delta$ MCC is 0.77 − 0.71 = +0.06, matching the +0.062 reported for this category.

Repeating the same procedure for splice‑site prediction yields MCC = 0.68 (A) vs. 0.73 (B), giving $\Delta$ MCC = +0.05.

This concrete trial shows how a modest corpus expansion can produce a measurable MCC gain on tasks that rely on regulatory signals.

**Table.** Performance comparison across task categories, showing $\Delta$ MCC and win counts. Green indicates multi-species advantage ($\Delta > +0.02$), gray indicates parity ($|\Delta| \leq 0.02$), and red indicates human advantage ($\Delta < -0.02$).

Few-Shot Performance Degradation

Few-shot evaluation reveals steep MCC drops and category‑specific robustness patterns.

GENEB standardizes genomic model evaluation, exposing that performance varies widely across tasks and can crumble when supervision is scarce.

Macro‑average MCC collapses from 0.488 in full‑shot to 0.106 in 1‑shot, a 78 % relative drop.

Figure 5 reports mean MCC of 0.488 (full), 0.253 (10‑shot), and 0.106 (1‑shot).

Promoter prediction retains 38.8 % of its full‑shot MCC at 1‑shot, and species classification retains 30.1 %.

Appendix F breakdown shows these percentages for the two categories.

Virus/phage, DNA methylation, and lncRNA fall to near‑random MCC (~0.02) at 1‑shot, a >93 % loss.

Figure 5 and Appendix F list MCC ≈ 0.027, 0.015, 0.022 respectively for the three categories.

Top‑performing models lose ≥0.45 MCC absolute when reduced to 1‑shot, while the weakest models lose ≤0.28.

EVO‑1‑131K $\Delta$ = 0.196 versus GENERATOR‑EUKARYOTE‑3B $\Delta$ = 0.489 (see Figure 5).

**Figure 5.** Few-shot performance degradation. Macro-average MCC of genomic foundation models under full-data, 10-shot, and 1-shot evaluation regimes. Models are sorted by full-data performance. The top band reports the relative performance drop from full-data to 10-shot evaluation, highlighting the sensitivity of each model to limited supervision.

The Limits of Scaling

High-variance tasks expose pretraining and architecture as decisive factors beyond model scale.

High‑variance tasks make pretraining scope and architectural family the primary predictors of top‑tier performance.

13 GENEB tasks exhibit cross‑model standard deviation above 0.12 (Figure 6A), and placement analysis (Figure 6B) shows systematic concentration of top‑3 slots.

Scaling alone does not close the absolute performance gap on the hardest tasks; even categories with statistically significant scaling still plateau far below saturation.

**Figure 6.** High-variance tasks reveal the role of pretraining data. (A) GENEB tasks with cross-model standard deviation above 0.12, corresponding to settings where model selection most strongly affects downstream performance. (B) Pretraining-data composition of top-3 and bottom-3 placements across these tasks. Multi-species and eukaryotic-gene pretraining dominate top placements, while human-only, prokaryotic, and microbial pretraining are concentrated among bottom placements. The result indicates that high-variance tasks expose biologically meaningful differences in pretraining scope that are obscured by aggregate leaderboards.

Category-Specific Model Selection

Practical model‑selection guidance per genomic task category.

Per‑category model selection improves overall macro‑MCC compared with relying on a single aggregate ranking.

MUTBERT (86 M parameters) attains 0.529 macro‑MCC, the highest score among sub‑100 M models, surpassing the aggregate‑ranking baseline.

The evaluation protocol (GENEB) and the macro‑MCC aggregation remain unchanged across all recommendations, ensuring a fair comparison.

Performance varies dramatically across genomic task categories, so the best model for one category may be far from optimal for another.

**Table 1. Per-category scaling correlations.** Spearman rank correlation $\rho$ between $\log_{10}$(parameter count) and macro-MCC within each functional category ($n = 40$ models). $\rho$ near +1 indicates that larger models systematically outperform smaller ones; $\rho$ near 0 indicates no monotonic relationship between size and performance. The $p$-value tests whether the observed $\rho$ differs from zero; bolded values are significant at $p < 0.05$. Rows sorted by $\rho$ descending. Scaling is significant in 11 of 13 categories, with $\rho$ ranging from 0.345 (DNA methylation) to 0.579 (histone modifications).

**Figure 4.** Radar plots for category-aware model selection. Each subplot shows full-shot macro-MCC across the 13 GENEB task categories for a group of five models, grouped by overall macro-MCC rank from strongest to weakest. The plots expose category-specific strengths not captured by aggregate rankings: ENFORMER has a moderate overall rank but leads on TF binding (0.698), enhancers (0.539), and regulatory tasks (0.604), and ranks second on mouse enhancers (0.674) and third on chromatin accessibility (0.711); the GENOMEOCEAN family is particularly strong on virus/phage tasks; and plant-oriented models such as PLANTCADUCEUS and AGRO-NT-1B show relative strength on lncRNA tasks. These profiles motivate task-specific model selection over global leaderboard position.

Limitations and Scope

The authors enumerate key constraints of GENEB that affect interpretation and model selection.

GENEB’s design leaves several important gaps that limit how its results should be interpreted. These gaps affect model selection, especially under data‑limited conditions.

Long‑range regulatory interactions (> 10 kb) are underrepresented in the benchmark, so models that explicitly model very long context (e.g., HYENADNA‑LARGE‑1M) are never tested where their architectural priors could matter most. Appendix D.3 lists the excluded long‑range datasets and the context‑length limits that caused the omission.

Task selection is constrained by the availability of public datasets and existing benchmarks, which means some constituent tasks are noisy or weakly defined. This curation bias can inflate or deflate performance on ill‑specified tasks.

Not all genomic foundation models could be included because their weights are unavailable, their pipelines are incompatible, or the required compute exceeds practical limits. Appendix D details the exclusion criteria and the models omitted for these reasons.

Only virus/phage classification represents a non‑eukaryotic domain; prokaryotic gene prediction, microbial genome assembly verification, and CRISPR system characterization are absent. Consequently, aggregate GENEB rankings are an unreliable proxy for performance on prokaryotic or viral genomics, as noted in the Domain Mismatch discussion.

GENEB evaluates frozen representations via linear probing, which enables controlled embedding comparisons but may underestimate gains achievable with task‑specific fine‑tuning. Moreover, single‑nucleotide and k‑mer tokenizations produce longer sequences than BPE, so the choice of pooling (mean, attention‑weighted, or final‑token) can favor certain tokenizations; these interactions remain unresolved.

Both micro‑ and macro‑averaged MCC are reported, yet histone‑modification (30 tasks) and promoter (22 tasks) together dominate the micro‑average, biasing it toward those categories. The authors therefore treat macro‑MCC as the principal aggregation and advise using category‑level results rather than a single overall ranking.

Probe Stability Analysis

We test whether probe choice and regularization affect model rankings on GENEB.

GENEB evaluates frozen representations with a linear probe, but we verify that this choice does not distort the reported rankings.

This table lists various biological categories and their corresponding tasks.

**Table 7.** Representative task subset used for probe stability and protocol sensitivity analyses. One task is selected from each of the 13 functional categories in GENEB.

We first replace the linear logistic‑regression probe with a single‑hidden‑layer MLP (256 ReLU units) and keep every other component identical.

This table lists various tasks and their corresponding correlation coefficients ($\rho$).

**Table 9.** Per-task Spearman rank correlation between linear and MLP probe rankings across the 11 representative models. Tasks are ordered by correlation strength. Twelve of thirteen tasks exhibit strongly positive rank correlation. The single negative value (GB ENSEMBL REGULATORY) corresponds to a task where MCC values are tightly clustered across models, leaving rank ordering dominated by noise rather than substantive performance differences.

Overall, probe choice yields a mean MCC difference of +0.011, confirming that linear probing is a reliable proxy for representation quality.

Next we examine how the regularization strength C of the linear probe influences few‑shot results.

The table presents statistics for pairwise $\rho$ across three different regimes: 1-shot, 10-shot, and Full-data. The columns include "Mean pairwise $\rho$", "Min $\rho$", and "Max $\rho$".

**Table 11.** Pairwise Spearman correlation between model rankings at different regularization strengths in the 1-shot regime. Rankings are nearly invariant ($\rho \geq 0.982$) across all $C$ pairs.

**Table 12.** Pairwise Spearman correlation between model rankings at different regularization strengths in the 10-shot regime. Adjacent $C$ values yield $\rho \geq 0.9$, with substantial divergence appearing only between extreme settings.

**Table 13.** Per-model MCC range across regularization strengths $C \in \{0.01, 0.1, 1, 10, 100\}$ in the full-data regime. The largest sensitivity is observed for LUCAONE, HYENADNA-LARGE-1M, and NT-v2-50M-MS; the smallest for GENOMEOCEAN-500M.

These ablations demonstrate that both probe selection and regularization have limited impact on model rankings, reinforcing the robustness of the main GENEB conclusions.

Results Analysis Overview

Key numeric trends across few‑shot, aggregation, and controlled‑pair analyses are presented.

Micro‑ and macro‑averaged rankings are virtually identical (Spearman $\rho$ = 0.988), confirming that the primary GENEB conclusions are robust to the choice of aggregation.

Computed across all 40 models and 100 tasks, the rank correlation exceeds 0.98 with p < 0.001.

The few‑shot degradation observed in the main paper—mean MCC drops sharply from full‑shot to 1‑shot—remains consistent across all regularization settings, indicating that the effect stems from data scarcity rather than regularization choices.

Controlled‑pair comparisons (30 matched pairs) isolate single factors while holding all others constant, yet residual confounds such as model size, training duration, and pretraining objective persist, especially in architecture and tokenization sweeps.

Across the 12 tokenization pairs three regimes emerge: (i) BPE outperforms k‑mer in Transformer‑decoder models (+0.020 on average, with a peak gap of +0.071); (ii) BPE and k‑mer are comparable in Transformer‑encoder models (+0.006); (iii) single‑nucleotide tokenization (MUTBERT) consistently beats BPE in human‑pretrained encoders (+0.033 to +0.038).

Out‑of‑domain models show the largest MCC shifts when moving from macro‑ to micro‑averaging: EVO‑1‑131K (‑0.044), CADUCEUS‑PS‑131K (‑0.028), and PLANTCADUCEUS (‑0.024), reflecting their uneven category‑level performance.

Overall, the three analysis questions—architectural influence, tokenization strategy, and scale‑pretraining interaction—are answered by the systematic results presented here, reinforcing the paper’s central claim that model superiority is task‑dependent and fragile under data‑limited regimes.

Performance Landscape

We map how model size and design affect MCC across tasks, highlighting scaling trends and variability.

Model capacity positively correlates with macro‑MCC across the benchmark.

Spearman correlation $ρ = 0.565$, $p < 0.001$ between $\log_{10}$(parameter count) and macro‑MCC.

The benchmark evaluates 40 models spanning eight architectural families, four tokenization schemes, and eleven pretraining corpora. Tasks cover 100 downstream problems grouped into 13 functional categories, from histone modifications to virus detection. This breadth lets us isolate how design choices affect performance.

Boxplot analysis shows MCC variance differs markedly across categories, and the top‑ranked model wins only 20 of 100 tasks, with 15 other models sharing the remaining wins. Such fragmentation highlights the need for task‑aware model selection.

Even within the same capacity tier, performance can diverge: 31 in‑domain models outperform a counterpart that is at least five times larger, confirming that scale alone does not guarantee superiority.

Read the original paper

Open the simplified reader on Paperglide