A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL
Lei Yang, Siyu Ding, Deyi Xiong
Multi-domain RL interference arises from localized second-order damage on shared computation routes, not global gradient conflict.
How can we mathematically identify and selectively roll back the specific model parameters causing cross-domain interference in multi-domain RL?
Sequential reinforcement learning (RL) often degrades performance on earlier domains, a phenomenon typically attributed to catastrophic forgetting or global gradient conflict. However, these explanations fail to account for cases where global gradients remain nearly orthogonal despite significant performance drops. The authors show that interference is a localized second-order effect: sparse RL updates from later domains interact with earlier domains through shared active computation routes. Degradation concentrates in a low-dimensional "conflict subspace" where update directions determine whether domains act synergistically or antagonistically. A short "refresh" training step on the damaged domain geometrically contracts this harmful component, enabling selective recovery without reversing the entire later-domain update. This approach recovered Math performance from 57.66 to 66.04 while preserving other domains.
Paper Primer
The paper identifies that cross-domain interference is not a global model-wide phenomenon. Instead, it is a route-level interaction: different domains share active computation routes, and sparse parameter edits from later domains cause damage when they align with high-curvature directions of the earlier domain's objective.
The core mechanism is a local perturbation model: later-domain updates act as a second-order displacement on the earlier domain's loss landscape. Recovery is achieved by a short domain refresh, which acts as a geometric contraction of the harmful update component within the shared conflict subspace.
Short domain refresh enables selective recovery of degraded domains.
In a sequential Code → Math → QA → CW curriculum, a brief Math refresh after the final stage recovered Math performance from 57.66 to 66.04. Recovery of ~8.4 points on the Math benchmark while maintaining performance on other domains.
Interference is localized to a low-dimensional conflict subspace.
A training-free rollback on a sparse coordinate proxy (selecting only 2% of MLP neurons) recovered 20.4% of the QA-induced Math loss. Targeted intervention on a small fraction of parameters significantly mitigates cross-domain damage.
Why does this problem matter if we can just use joint training?
Joint training requires balancing multiple objectives simultaneously, which is computationally expensive and often leads to sub-optimal performance across all domains. This paper provides a mechanism to understand and fix interference in sequential training, which is more scalable and avoids the need for complex data-mixture tuning.
Is this theory limited to specific model architectures?
The theory is grounded in local smoothness and route-level activation, which are general properties of modern transformer-based RL. While the authors validate this on Qwen3-4B, the mechanism of second-order damage on shared routes is expected to generalize to other post-training paradigms like on-policy distillation.
The authors' findings suggest that stable multi-domain RL does not require global gradient surgery or complex replay schedules; instead, targeted, route-aware refreshes on degraded domains are sufficient to maintain performance.
Researchers can now treat cross-domain interference as a localized route-level problem, replacing global gradient balancing with targeted, short-duration refreshes to recover performance.
Introduction and Motivation
Identifies why sequential multi‑domain RL degrades earlier tasks and how to recover them.
Sequential multi‑domain reinforcement learning improves a model on one task but often harms performance on previously learned tasks, a phenomenon the paper calls cross‑domain interference.
When a later‑domain RL update changes parameters that lie on computation routes still used by an earlier domain, the update can push those routes in a direction that harms the earlier task.
Empirically, full‑model gradient cosine between domains is near zero, yet Table 1 shows a dramatic drop in Math performance after QA and CW training while Code and QA stay stable.
Layer‑ and module‑level analyses (Figure 1) expose localized conflict: some modules (e.g., layers.30.attn) exhibit strong negative cosine with the Math‑QA pair, whereas others (layers.7.attn) show positive alignment, confirming that interference is highly non‑uniform.
Under a local perturbation model, after training on domain A the checkpoint is approximately stationary for A’s objective; a later‑domain update therefore contributes a second‑order damage term that is large only when the update projects onto the curvature‑sensitive directions of A’s loss, i.e., the shared conflict subspace.
A brief “refresh iteration” on the earlier domain contracts the harmful component in this subspace, restoring Math to 66.04 while keeping Code, QA, and CW nearly unchanged, yielding the best average score of 66.39. A training‑free rollback on a sparse proxy of the conflict coordinates also recovers part of the loss.
The paper frames interference as a local, low‑dimensional phenomenon.
Related Work and Baselines
We survey prior multi‑domain RL work and detail our four‑domain experimental platform.
Related work on multi‑domain reinforcement learning has largely focused on maximizing aggregate returns, leaving the internal cross‑domain dynamics opaque.
When a model is trained sequentially on multiple tasks, updates for later tasks can overwrite parameters that were essential for earlier tasks, causing a sudden drop in performance on those earlier tasks.
Our empirical study builds a controlled four‑domain RL suite using the Qwen3‑4B‑Thinking‑2507 backbone, covering mathematics, code generation, question answering, and creative writing. Data are drawn from OpenR1‑math, KlearReasoner‑CodeSub‑15K, SuperGPQA, and the crownelius/Creative‑Writing series. Training follows the GRPO algorithm in VeRL with identical hyperparameters across domains, differing only in reward functions.
**Table.** Performance metrics across different domains (Code, Math, QA, CW). The table shows the performance of models trained on specific domains (rows) evaluated on others (columns).
**Table 1.** Performance comparison across different tasks (Math, Code, QA, CW) using various methods (Base, $Code_s$, $Math_s$, $QA_s$, $CW_s$, $Code_o$, $Math_o$, $QA_o$, $CW_o$, CGPO, JT, Re-Math).
**Table 1.** Performance of various neuron selection strategies for mitigating catastrophic forgetting. The table compares different selectors (`Math_o`, `QA_o`, Random, A×M, A×C, M×C, A×M×C, and Joint MLP+Attn) across metrics including Budget, Math Avg, $\Delta$ vs `QA_o`, Recovery, and $\Delta$ QA Avg.
Identifying the Conflict Subspace
We locate the precise places where cross‑domain interference manifests inside the model.
Cross‑domain interference is not a monolithic phenomenon; we first ask where it lives, then we trace how sparse updates travel through shared computation routes.
The conflict subspace is the low‑dimensional set of parameters that multiple domains edit in opposite directions, turning otherwise harmless tweaks into antagonistic interference.
Both domains modify the same two parameters (1 and 3); the other two stay unchanged.
Form the update vectors $\mathbf{u}_A = (0.001, -0.002)$ and $\mathbf{u}_B = (-0.001, 0.002)$ on the shared subspace.
Compute cosine similarity $\cos(\mathbf{u}_A,\mathbf{u}_B) = -1$, indicating perfect antagonism.
The conflict subspace can arise even when each domain’s overall update is tiny, because opposite signs on the same parameters amplify interference.
How does “Conflict Subspace” differ from the generic “interference region” mentioned in the introduction?
“Interference region” is a vague notion of any area where tasks clash. “Conflict Subspace” is a precise, low‑dimensional linear span of shared active neurons whose updates are directionally opposed, which can be measured with cosine similarity.
Global‑gradient analysis computes the cosine $\cos(g_{d_i}, g_{d_j})$ between domain‑specific gradients $g_d$. Between Math and QA the cosine hovers around zero, suggesting near‑orthogonal full‑model updates.
**Figure.** (a) Absolute parameter changes. (b) Relative parameter changes.
Parameter‑change histograms (Figure 2) confirm the updates are both sparse and of tiny magnitude, reinforcing the view of RL as a gentle perturbation rather than a wholesale rewrite.
To test whether interference could stem from co‑editing the same neurons, we rank MLP intermediate channels by the aggregate magnitude of their $\Delta W$ (gate, up, down projections) and compute Jaccard overlap of the top 10 % per domain.
**Figure.** (a) Changed-neuron overlap. (b) Active-neuron overlap.
Despite weak edit overlap, inference‑time activations reveal a different picture: selecting the top 5 % most active neurons per domain yields substantially higher Jaccard scores for Math, Code, and QA, showing that these tasks share computation routes during forward passes.
Shared routes alone do not dictate interference; we therefore examine the directional alignment of updates on the intersecting neurons. For each domain pair we take the intersected top‑10 % neurons and compute cosine similarity of their update vectors.
**Figure 4.** Layer-wise average directional cosine on shared top-changed neurons across domain pairs
These observations motivate modeling cross‑domain interference as a localized conflict subspace: sparse edits propagate through shared active routes, and the sign of their projection determines whether the effect is synergistic or harmful.
Local Perturbation Theory
Local perturbation theory explains interference and fast recovery via short refreshes.
Three observations from the structural evidence drive the theory. First, full‑model gradient conflict cannot explain interference because gradients from different domains are often nearly orthogonal. Second, RL updates in each domain are sparse and have tiny magnitudes, touching only a few neurons. Third, even such sparse edits can clash when domains reuse the same active routes; the direction of the edit on those routes decides whether the interaction is synergistic or harmful.
Think of a model’s parameters as a tightrope. Training on a new domain nudges only a tiny segment of that rope; if the nudge pushes the rope where it’s already taut (high curvature), the rope sags elsewhere. A short “refresh” pulls back just the sagging segment, restoring balance without re‑tightening the whole rope.
Later‑domain update: $\delta_B = (0.20,\,-0.10,\,0.05)$.
Projection onto $S_{A,B}$: $P_S\delta_B = (0.20,\,-0.10,\,0)$.
Norm of projected component: $\|P_S\delta_B\|_2 = \sqrt{0.20^2 + (-0.10)^2} \approx 0.224$.
After one refresh step with step size $\alpha=0.5$ and curvature $\mu_A=1$, the projected norm contracts to $(1-\alpha\mu_A)\times0.224 = 0.112$.
The third coordinate remains unchanged (0.05), illustrating that the refresh leaves non‑conflict directions untouched.
The example shows how a short refresh can halve the harmful component while preserving unrelated parameters, embodying the “pull‑only‑the‑sagging‑segment” intuition.
The theory yields two testable predictions. (1) A brief refresh on the earlier domain should quickly restore its performance, because it contracts exactly the damaging subspace. (2) Even a sparse rollback that only removes a coordinate proxy for the projected component should partially heal the damage without retraining the whole model.
Task-Level Validation and Intervention
Short Math refresh recovers most loss while leaving other domains intact.
Re‑Math attains the best average score of 66.39, surpassing mixed‑training baselines.
Table 2 shows Re‑Math’s average 66.39 versus lower averages for baselines.
A brief “Refresh Iteration” on the degraded task undoes the harmful component of later updates, pulling its performance back up while other tasks stay essentially unchanged.
**Figure 9.** Validation dynamics during Re-Math refresh from $CW_o$ across the four domains.
**Figure 11.** Validation dynamics for the reverse QA $\rightarrow$ Math ordering.
Extended Structural Analysis
Deep analysis of gradient conflict, parameter sparsity, and neuron‑level edits across domains.
Appendix B provides deeper quantitative evidence that gradient conflict is localized and that each domain adds only a sparse set of parameter updates.
Module‑level gradient analysis (Figures 5–8) shows that conflict is concentrated in early MLP layers for the Math‑QA pair, while synergy appears in later layers for Code‑Math. Despite these localized conflicts, the raw cosine values stay near zero, indicating that most parameters remain orthogonal across tasks.
**Figure b.** MLP-module heatmap of pairwise gradient cosine.
**Figure (a).** Attention-module heatmap of pairwise gradient cosine.
**Figure b.** Top six attention and MLP modules with the strongest synergy.
**Figure.** Top 6 Conflict (`ATTN_MLP`) — cosine over steps. (a) Top six attention and MLP modules with the strongest conflict.
**Figure.** (Top) Relative cumulative parameter changes relative to the base model. (Bottom) Relative incremental parameter changes relative to the previous checkpoint.
**Figure a.** Absolute cumulative parameter changes relative to the base model.
**Figure (a).** Absolute incremental parameter changes relative to the previous checkpoint.
Sequential multi‑domain training reduces the fraction of near‑zero parameters from 77.6 % after the first domain to 73.4 % after all four domains.
Measured on the parameter‑change distributions shown in Figure 9.
Overall, gradient‑conflict heatmaps, sparse‑update statistics, and neuron‑overlap metrics demonstrate that cross‑domain interference is highly localized, justifying the paper’s selective rollback strategy.
Formal Structural Assumptions
Formal structural conditions that support the local perturbation analysis.
Assumption 1 requires each domain loss $L_d$ to be twice continuously differentiable in a neighbourhood $B(\bar\theta, r)$ around the training trajectory, and its Hessian norm is uniformly bounded by $\beta_d$.
When a third‑order remainder is needed, Assumption 1b adds a Lipschitz condition on the Hessian.
Assumption 2 states that after training on domain A, the selected checkpoint $\theta^{*}$ is approximately stationary for its own objective.
Assumption 3 captures that training on a later domain B produces a perturbation $\delta_B$ that is both small (norm ≤ $r$) and effectively sparse, i.e. most of its energy lies in a low‑dimensional subspace $U_B$.
Assumption 4 introduces a low‑dimensional shared conflict subspace $S_{A,B}$ that captures the directions where domains A and B interfere.
Assumption 5 asserts that, despite local conflict inside $S_{A,B}$, the full gradients $g_A$ and $g_B$ are nearly orthogonal at the global scale.
Assumption 6 provides curvature guarantees on $S_{A,B}$ and ensures weak coupling in its orthogonal complement.
Finally, the refresh iteration chooses a step size $\alpha$ and a coupling bound $\xi_A$ that keep the update stable inside $S_{A,B}$.
Proposition Proofs
Derives Proposition 1 and 2, showing interference depends on curvature and the conflict subspace.
We now give the full derivations of Proposition 1 and Proposition 2, exposing how interference decomposes into curvature and conflict‑subspace components.
When the gradient at the A‑checkpoint is negligible, the quadratic curvature term dominates interference, so even tiny updates can cause forgetting if they lie in high‑sensitivity directions.
The first term captures damage inside the conflict subspace, the second measures how off‑subspace motion can leak back into sensitive directions, and the third reflects curvature effects that stay outside the subspace.
Consequently, even with globally near‑orthogonal gradients (Section 4), localized curvature within the small shared active subspace can dominate forgetting.
Theorem Proofs
Formal guarantees for refresh dynamics and their extensions.
Decomposing $e_t$ into its conflict component $e^{S}_t$ and orthogonal part $e^{\perp}_t = (I-P_S)e_t$ lets us bound the influence of off‑subspace directions via the coupling constant $\xi_A$.
Stability of other domains follows from the same smoothness assumptions: a refresh step on domain A perturbs the loss of any other domain C only through the inner product of their gradients.
These bounds show that as long as the total refresh displacement remains modest, other domains stay essentially unchanged.
Extension: alternating refresh across multiple domains can be interpreted as gradient descent on a weighted sum of local losses.
Thus, while a single refresh targets recovery of one degraded domain, alternating refresh provides a principled local mechanism for approaching a Pareto‑stationary compromise among all domains.
Full Task-Level Results
Appendix D details benchmark scores, refresh behavior, interference asymmetry, and proxy rollback experiments.
D.1 reports the full benchmark‑level results for every checkpoint used in the main task‑level validation; the numbers are summarized in Table 5.
**Table 5.** Full benchmark-level results. Step denotes the selected checkpoint step for the corresponding run. AVG is the average over the four evaluation domains: Math, Code, QA, and CW.
D.2 examines refresh dynamics. Re‑Math behaves as a short local correction: Math scores rise sharply in the first few refresh steps and then plateau, whereas Code, QA, and CW change only marginally.
This pattern matches the contraction view of Theorem 1: the refresh update primarily removes the Math‑sensitive harmful displacement introduced by later‑domain training without overwriting the whole policy.
The Code → Math → Re‑Code trajectory further illustrates local recoverability. Early refresh improves Code while Math is barely affected; later Re‑Code training causes a noticeable Math decline, exactly as Proposition 1 predicts when the accumulated displacement grows.
D.3 compares the ordering QA → Math versus Math → QA. In the former, Math improves steadily and QA stays stable; in the latter, Math degrades sharply while QA is less affected, demonstrating directional interference.
The asymmetry aligns with the local perturbation analysis: damage to an earlier domain depends on whether the later update moves along that domain’s curvature‑sensitive directions.
D.4 probes the localization claim of Proposition 2 by intervening on a coordinate proxy for the Conflict Subspace.
Using the full three‑factor selector (A × M × C) we revert the QA‑induced displacement on the top‑scoring neurons; this recovers 20.4 % of the Math loss.
The two‑factor selector $M\times C$ is close, recovering 18.3 %; dropping either $A$ or $C$ cuts recovery roughly in half, confirming that all three signals contribute.
When we keep the full selector’s layer allocation but rank only by $A$ (activation‑only), recovery matches the full result; however, recomputing the layer budget from $A$ alone drops performance to 16.6 %, showing that the layer‑wise weighting supplied by the conflict term is essential.
Table 6 (below) lists these ablations together with random baselines.
Figure 12 visualizes the layer‑wise budget distribution: including the conflict term yields a pronounced U‑shape (more neurons in shallow and deep layers), whereas omitting it flattens the profile.
Figure 13 shows a clear dose‑response curve: increasing $\beta$ from 1 % to 4 % raises recovery from 2 % to 29.4 %; beyond 4 % the gain saturates.
Extending the intervention to attention layers yields a joint MLP + Attn selector. At $\beta\!=\!2\%$ it recovers only 12.7 % (worse than MLP‑only), but from $\beta\!=\!4\%$ upward it surpasses MLP‑only, reaching 73.6 % recovery at $\beta\!=\!32\%$.
Even with large budgets recovery does not reach 100 %, indicating residual interference outside the identified proxy coordinates or effects that are not captured by simple coordinate rollback.
Training Hyperparameters
Appendix details data construction, hyperparameters, and reward setups.
All prompts are truncated or filtered to stay within a 2,048‑token limit, and every domain contributes exactly 5,120 training examples. Validation sets are built from non‑training data: the Math domain uses the 30‑problem AIME25 set, while Code, QA, and Creative‑Writing each have 50 validation examples.
Creative‑Writing data are sampled from four curated subsets (Sonnet4.6‑800x, Gemini3Pro‑2700x, Reasoning‑KimiK2.5‑600x, Qwen3.5Plus‑2000x). For half of these (2,560 examples) we regenerate reference responses using the Qwen3‑235B‑A22B‑Instruct‑2507 model, then treat the new outputs as the ground‑truth references.
Evaluation draws from three external suites: LiveCodeBench‑v6 supplies 175 coding problems dated Jan–Apr 2025; QA testing samples 10 % of the remaining SuperGPQA pool, yielding 2,141 examples; MMLU‑Pro is similarly sampled to 1,203 test items.
Training follows a unified GRPO configuration whose full list appears in Table 4. Key settings include a train batch size of 256, PPO mini‑batch size of 256, max prompt length 2,048, max response length 16,384, learning rate $1\!\times\!10^{-6}$, and a KL‑loss coefficient of $0.0$ with type “`low_var_kl`”.
During single‑domain training we keep the checkpoint that achieves the highest validation score for that domain. In mixed‑domain training we instead pick the checkpoint whose average validation performance across all domains is maximal.
Reward signals differ by task type: Math and QA receive a binary reward (1 for a correctly boxed answer, 0 otherwise). Code tasks are rewarded proportionally to the fraction of test cases that pass. Creative‑Writing uses an LLM‑as‑a‑judge preference reward: 1 if the model response is judged better than the reference, 0.5 if comparable, and 0 if worse.
All WritingBench queries generate a single model response, which is scored on multiple dimensions by the same Qwen3‑235B‑A22B‑Instruct‑2507 judge used for reference resampling.
**Table.** Hyperparameters used for training.