Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation

Reinforcement learning teaches LLMs to leverage in-context linguistic resources for translating unseen languages.

Can reinforcement learning (RL) teach LLMs to use in-context linguistic information (dictionaries, grammars) to translate low-resource languages better than standard supervised fine-tuning (SFT)?

Large language models struggle to translate extremely low-resource languages because they lack sufficient pre-training data and often overfit to specific language pairs during fine-tuning. The authors treat translation as a reasoning task, using reinforcement learning (RL) to train models to actively extract and apply linguistic information—such as grammar rules and dictionary entries—provided in the prompt. This approach enables models to generalize to entirely unseen languages, significantly outperforming supervised fine-tuning which tends to memorize training data rather than learn the meta-skill of contextual leveraging.

Paper Primer

The method frames translation as a meta-learning problem where the model is rewarded for translation quality (chrF) while conditioned on a "support set" of linguistic context. The core move is using Group Relative Policy Optimization (GRPO) to incentivize the model to treat the provided grammar, dictionary, and parallel sentences as a functional toolset rather than just passive text.

RL-trained models demonstrate superior generalization to unseen languages compared to supervised fine-tuning (SFT).

On five unseen languages from unrelated families, the RL-trained Qwen model achieved an average chrF of 0.27, compared to 0.09 for SFT and 0.18 for the base model. RL improves performance on unseen languages by approximately 3x over SFT.

Ablation studies reveal that bilingual dictionaries are the most critical component for performance, providing a direct word-level grounding that grammar passages and parallel sentences cannot fully replace. The model's ability to use these resources is learned during training, but the policy remains flexible enough to exploit retrieval context even if it was not explicitly trained with it.

Why use reinforcement learning instead of standard supervised fine-tuning for this task?

SFT encourages the model to memorize specific language mappings, which leads to overfitting and poor performance on new languages. RL, by contrast, optimizes for the meta-skill of using in-context information, allowing the model to adapt to any language provided in the prompt.

Does the model actually "learn" the grammar, or is it just relying on the parallel examples?

The authors find that parallel sentences and dictionaries are the primary drivers of performance, while grammar passages contribute minimally. This suggests the model relies more on direct linguistic evidence than on abstract rule application.

Introduction and Motivation

We expose why standard SFT overfits low‑resource languages and motivate a context‑aware RL solution.

Standard Supervised Fine‑Tuning (SFT) overfits to the specific low‑resource languages it sees, limiting zero‑shot transfer. In contrast, Reinforcement Learning (RL) can train models to actively exploit in‑context linguistic resources such as dictionaries and grammars, encouraging a meta‑skill of contextual leveraging. This motivates a shift from static SFT toward a context‑aware RL approach for unseen language translation.

When a model is trained only on the limited data of a few low‑resource languages, it memorizes language‑specific patterns instead of learning how to use external linguistic knowledge.

Shifting from static SFT to context‑aware RL equips LLMs with a transferable meta‑skill for translating unseen languages.

Prior Work in Low-Resource Translation

We survey prior low‑resource MT approaches that exploit contextual resources and reinforcement‑learning methods.

Early work on low‑resource translation highlighted the value of explicit linguistic resources. Tanzer et al. (2024) introduced the MTOB benchmark, pairing English with Kalamang and supplying a grammar book as context. Subsequent studies (Zhang et al., 2024b; 2024a; 2025) incorporated bilingual dictionaries, grammar books, and morphological analyses, often via the DiPMT framework, to guide LLM translation without updating parameters.

The MTOB benchmark evaluates how well a model can translate a truly low‑resource language (Kalamang) when provided with a grammar book that encodes linguistic rules.

Parallel lines of research have applied reinforcement learning to machine translation. RLVR (Lambert et al., 2025; Guo et al., 2025) and GRPO (Shao et al., 2024) pioneered RL‑based fine‑tuning for MT despite the lack of a single correct output. Building on these, Feng et al. (2025) introduced MT‑R1‑Zero, combining BLEU and COMET‑Kiwi rewards under the GRPO framework, while He et al. (2025) added format‑signal rewards to encourage reasoning. Later works (Wang et al., 2026; Yang et al., 2026) moved beyond reference‑based metrics toward trajectory‑level generative and self‑rewarding signals, and He et al. (2024) showed quality‑estimation models can serve as data‑efficient RL rewards.

Recent efforts target low‑resource settings by teaching LLMs to consult external tools. Mosquera et al. (2025) use RL to make an LLM retrieve bilingual dictionaries during Spanish↔Wayuunaiki translation. Attia and Aji (2026) propose self‑supervised round‑trip RL with intrinsic chrF++ rewards for languages such as Aymara, Friulian, and Wolof. These approaches differ from earlier RL‑for‑MT methods that directly optimize translation quality from source‑target pairs; they instead aim to improve the model’s exploitation of in‑context linguistic resources.

The meta‑learning perspective predates RL‑based methods. Gu et al. (2018) applied MAML to low‑resource MT to learn initializations that adapt quickly to new language pairs. With the rise of in‑context learning, Brown et al. (2020) framed few‑shot prompting as implicit meta‑learning, leading to explicit meta‑training approaches such as MetaICL (Min et al., 2022), in‑context tuning (Chen et al., 2022), and theoretical analyses of transformer in‑context capabilities (Garg et al., 2022). Our work aligns with this lineage, using RLVR to teach models to exploit heterogeneous contextual resources rather than merely optimizing for translation fidelity.

Data Curation and Language Splits

We build a curated multilingual corpus that balances language splits and injects contextual retrieval for RL training.

Standard SFT quickly memorises the few parallel sentences available for low‑resource languages, limiting its ability to generalise. To give the RL agent a richer learning signal we curate a corpus that mixes language splits and explicit retrieval of external linguistic resources.

We partition languages into three buckets—Seen (training + test), Similar (unseen directions but same family), and Unseen (no related training data)—so the same model can be evaluated on progressively harder generalisation gaps.

Assign L₁ sentences to both training and test sets (4 train + 4 test).

Assign L₂ sentences only to the test set (0 train + 8 test) to simulate a family‑related but unseen direction.

Assign L₃ sentences only to the test set (0 train + 8 test) to simulate a completely new family.

During RL training the model sees only L₁ data; evaluation reports performance on L₁, L₂, and L₃.

This toy split mirrors the real corpus: the model learns from abundant Seen data while being forced to extrapolate to Similar and Unseen languages it has never observed.

How does the “Similar” split differ from the “Seen” split if they share a language family?

“Similar” languages are held out entirely from training, even though they belong to the same family as a Seen language. The model therefore cannot rely on direct exposure to those lexical forms; it must transfer knowledge across related but unseen directions.

For each translation prompt we attach three retrieved items—dictionary entries, grammar excerpts, and a short parallel example—selected by longest‑common‑subsequence similarity, giving the RL agent external linguistic cues it could not learn from SFT alone.

Compute LCS between the source sentence and each entry; the entry “river → flüm” obtains the highest score.

Insert the dictionary line “river → flüm” and the grammar excerpt “Adjective follows noun” into the prompt.

Append a parallel example “The mountain is high → la muntogna è auta”.

The model then produces a step‑by‑step reasoning before outputting the final translation “Il flüm curra svelt”.

The retrieved items act as a lightweight knowledge base that the RL agent can query on‑the‑fly, reducing reliance on memorised parallel data.

Why does the paper prefer LCS over embedding‑based similarity for retrieval?

LCS respects exact token order, which is crucial for morphology‑rich languages where word order signals grammatical relations. Embedding similarity can match semantically related tokens that appear in different positions, leading to noisy or irrelevant retrievals.

**Table 2.** Data summary. Languages are partitioned by evaluation split. Seen languages appear in both train and test; Similar directions are held out for evaluation but share their family with seen varieties; Unseen directions have no related training data. Only Seen languages are used for training; struck-through numbers (7,998, 750) indicate parallel data that exists for the held-out Similar and Unseen languages but is excluded from training. Dirs reports translation directions as langs × directionality (×2 for bidirectional, ×1 for unidirectional). Each prompt contains ~20–34 dictionary entries, 5 parallel sentence pairs, and optionally a grammar-book passage (~2.8k tokens on average). ‡Sursilvan → De and Surmiran → De, held out from training.

Prompt composition follows a five‑component template (Table 1) and ends with an instruction to reason step‑by‑step before translating. Across 18 languages we obtain 32,335 training pairs and 2,699 test pairs, with Romansh varieties providing the bulk of the training data.

The RL-with-Context Framework

We train translation models to reason over retrieved linguistic context rather than merely mimicking references.

The model receives a prompt that concatenates the source sentence with dictionary entries, parallel examples, and grammar notes; it is rewarded only for the final translation it produces.

How does RL‑with‑Context differ from a vanilla RL setup for language models?

Standard RL would reward the entire generated sequence; here we reward only the final translation while allowing the model to emit unrestricted reasoning. Moreover, the linguistic context is part of the prompt, not an external state, so the policy learns to attend to dictionary entries and grammar notes only when they improve the final translation.

The prompt becomes “[Dictionary: cat → gato] [Parallel: the cat sleeps → el gato duerme] [Grammar: nouns are gendered] Source: cat”.

The model first generates a reasoning line “The word ‘cat’ maps to a masculine noun in Spanish.”

It then outputs the translation hypothesis ŷ = “gato”.

chrF between “gato” and the reference “gato” is 100, so r = 1.0.

Because the reward is maximal, the advantage A = (1.0 – $\mu_{G}$)/$\sigma_{G}$ is positive, encouraging the current policy.

The example shows that the reasoning trace is ignored by the reward; only the final word matters, which lets the model experiment with diverse explanations without penalising it.

GRPO update loop for a single training step.

**Table.** Components of an assembled prompt. The parallel-sentence count (3 or 5) is an experiment.

The core trick is to reward only the final translation while feeding the model rich linguistic context, letting the policy discover how to exploit that context.

Main Results: SFT vs. RL

Standard SFT overfits low‑resource data, while RL trains models to leverage retrieved context for broader generalization.

We compare three difficulty tiers—seen Romansh varieties, similar held‑out varieties, and five unrelated languages—under both full retrieval context and a no‑context baseline. Table 3 reports chrF scores for two base models (Qwen3‑4B‑Base and Llama‑3.2‑3B‑Instruct) across these conditions.

RL outperforms SFT on unseen languages by 0.18 chrF points.

Table 3 shows RL averages 0.27 versus SFT 0.09 on the five unseen languages.

**Figure 2.** RL reward trajectories on Qwen3-4B-Base under three prompt configurations. (a) Held-out WMT24++ reward. (b) chrF training reward; faint lines raw, solid lines EMA-smoothed ($\alpha=0.92$).

RL consistently outperforms SFT across all language splits.

Component Ablation and Context Mismatch

We quantify how each retrieval component and context timing affect translation quality.

We run five matched RL experiments on Qwen3‑4B‑Base, each omitting a single retrieval component from both training and test prompts. This design guarantees the policy never encounters a missing component at inference that it saw during training.

**Figure 1.** **Train–test context mismatch** (RL, Qwen3-4B-Base). Test-time context dominates: no/full > full/no in every panel (En→Kal: 0.28 vs. 0.17).

Removing the bilingual dictionary hurts performance dramatically.

On seen Romansh (Puter/Vallader) chrF falls from 0.5324 to 0.4483, and on En→Kalamang from 0.3464 to 0.2626.

Removing parallel sentences reduces chrF by 1.0 on seen Romansh.

Score drops from 0.5324 to 0.4324 on Puter/Vallader.

On the out‑of‑distribution Kalamang pair, parallel sentences are crucial.

chrF declines from 0.3464 to 0.2733, a 7.3‑point loss.

Removing grammar passages yields a modest 0.8‑point chrF drop on seen Romansh.

Score falls from 0.5324 to 0.5246.

Grammar removal also hurts En→Kal, lowering chrF by 1.5 points.

Score declines from 0.3464 to 0.1914.

Test‑time context alone outperforms training‑time context by 0.11 chrF on En→Kal.

no/full achieves 0.28 chrF versus 0.17 chrF for full/no.

Training with context further improves results, adding +7 chrF on En→Kal.

full/full reaches 0.35 chrF, beating no/full’s 0.28 chrF.

Qualitative Case Study

RL with full context exactly matches the reference translations in the case study.

Full‑context RL achieves exact matches on both case‑study sentences, while Full‑context SFT misses the word “cold” in the second.

Table 5 shows the RL translation matches the reference for (1) and captures both “bathe” and “cold” for (2), whereas SFT only captures “bathe” and omits “cold”.

These results illustrate that the retrieved dictionary supplies the strongest learning signal, while the grammar component contributes the weakest boost. Together with the ablation analysis, they confirm the additive effect of each retrieval component on reward growth.

Conclusion and Limitations

RL leverages in‑context linguistic cues, outperforming SFT on unseen low‑resource languages.

We introduce a reinforcement‑learning (RL) approach that trains language models to actively exploit linguistic context retrieved in‑situ, rather than relying solely on supervised fine‑tuning (SFT) on scarce data.

By using translation quality (chrF++) as the direct reward, our experiments show RL‑trained models generalize substantially better to languages unseen during training, and they truly harness contextual cues instead of merely memorizing the seen languages.

Our work therefore offers a fresh angle on low‑resource translation, merging the complementary strengths of in‑context learning and RL.

We report chrF++ scores but omit human evaluation, which would provide a sharper view of fluency and adequacy; we treat the automatic metric as a reliable proxy for relative gains across methods.

Although our method outperforms SFT on unseen languages, absolute performance still trails that on higher‑resource languages, highlighting room for richer in‑context evidence and stronger context‑utilization signals.

Grammar Sources

We disclose LLM assistance and list the grammar books that supplied parallel data.

We employed a large language model (LLM) in two ways: to edit the manuscript text and to generate synthetic bilingual dictionary entries for low‑resource languages using GPT‑5 mini.

Because the authors are not speakers of the target languages, the synthetic entries were not manually validated; instead we chose between two prompt variants (v1 and v2) by measuring downstream chrF (character n‑gram F‑score) on translation outputs and kept the higher‑scoring variant for each language direction.

All core ideas, methodology, experiments, and analysis are the authors' original contributions, and the AI‑edited text was carefully reviewed for accuracy.

**Table 6.** Grammar-book sources used for data curation (§3.1). The eight grammars in the upper block are published by Language Science Press under the CC BY 4.0 license with LaTeX source available, from which we extracted parallel translation examples. The two Romansh idiom grammars in the lower block are published by Lia Rumantscha as print volumes; for these we extracted examples from the printed text rather than LaTeX source.

Language Resources

Key language families, splits, and resource availability for the appendix.

This appendix enumerates the languages used across three resource categories—seen Romance varieties, grammar‑book languages, and unseen OOD languages—detailing their families, split directions, and the availability of dictionaries and parallel corpora.

**Table.** Overview of languages, their families, data splits, directions, availability of dictionaries (Dict.) and parallel data (Par.), grammar sources, and pair counts.

Prompt Templates

Prompt templates for Romansh‑German and Kalamang‑English translation tasks.

This appendix records the exact prompt templates used for the two endangered‑language translation experiments. Each template is broken into six reusable components, with language‑specific wording shown for Romansh→German and Kalamang→English.

Read the original paper

Open the simplified reader on Paperglide