Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging
Minsik Choi, Geewook Kim
Decentralized instruction tuning via conflict-aware dataset splitting and one-shot weight merging.
How can we scale instruction tuning to heterogeneous data mixtures without the high communication costs of centralized synchronization or the performance degradation caused by gradient interference?
Centralized instruction tuning on heterogeneous mixtures forces conflicting datasets to share a single, synchronized training trajectory, which slows progress and requires high-bandwidth communication between GPUs. MERIT addresses this by partitioning the mixture into conflict-aware groups, training each branch independently on disjoint hardware, and reconciling them once via token-weighted parameter averaging. This approach improves performance over centralized joint training while eliminating the need for inter-partition communication during fine-tuning.
Paper Primer
MERIT (Merge-Ready Instruction Tuning) hinges on the insight that fine-tuned models starting from a shared "merge-ready" initialization remain within a connected low-loss region, allowing for effective one-shot merging. The core move is a conflict-aware split: the system estimates dataset-level gradient interference at the initialization, uses PCA to identify the dominant axes of disagreement, and partitions the mixture to maximize the curvature-weighted variance reduction achieved during merging.
MERIT consistently outperforms centralized joint training on multimodal instruction-tuning benchmarks.
On Qwen2.5-VL-3B with 136 Vision-FLAN tasks, MERIT improves the 8-benchmark average from 54.3 to 57.0. A +2.7 point improvement in average benchmark score.
The method scales to 7B models and 1.6M-example mixtures, where it matches or exceeds centralized joint training while enabling parallel, communication-free training across fragmented compute environments.
Why use PCA-based splitting instead of random partitioning?
Random partitioning fails to systematically mitigate gradient interference. MERIT’s PCA-aligned splitting concentrates inter-group updates along high-curvature disagreement axes, which maximizes the merging gain—a benefit that grows with the Hessian spectral gap.
What is a "merge-ready" initialization?
It is a checkpoint from which independently fine-tuned models remain in a connected low-loss region, ensuring that one-shot parameter averaging does not lead to performance degradation. The authors verify this via linear mode connectivity and perturbation diagnostics.
Instruction tuning can be effectively decentralized: by optimizing the dataset split for conflict-awareness, practitioners can trade expensive, synchronized training for parallel, independent fine-tuning followed by a single, high-quality merge.
Introduction and Motivation
Instruction tuning struggles with gradient conflict and costly synchronization, prompting a decentralized solution.
Instruction tuning of large multimodal models relies on joint training over a heterogeneous mixture of tasks, but two intertwined bottlenecks arise: gradient interference that harms learning dynamics, and bandwidth‑heavy synchronization that forces tightly‑coupled clusters. MERIT tackles both by partitioning the mixture into conflict‑aware groups, fine‑tuning each group independently, and merging the resulting models once.
All tasks share a single training trajectory, so every gradient update is summed across the entire mixture before the model parameters are updated.
When two datasets produce gradients that point away from each other, their summed update pushes the model toward a compromise that may be sub‑optimal for both.
**Figure 1.** Centralized training vs. MERIT. Centralized tuning synchronizes conflicting tasks across a tightly-coupled cluster (top); MERIT partitions the mixture by conflict, fine-tunes each group independently, and merges once into $\bar{\theta}^*$ (bottom).
The core trade‑off is between preserving data diversity (which favors centralized joint training) and reducing synchronization cost (which favors decentralized partitioning).
Foundations of Model Merging
Prior work on model merging, federated optimization, and conflict‑aware instruction tuning.
Model merging exploits the geometry of loss landscapes: averaging checkpoints can improve generalization when the underlying minima are connected and flat.
When several checkpoints lie in a region where the loss surface is smooth, their parameters can be averaged directly, yielding a single model that inherits the strengths of each.
Two minima are “connected” if a low‑loss path exists between them, meaning one can move from one set of weights to the other without crossing a high‑error barrier.
Model‑soup literature (e.g., Model Stock) demonstrates that uniform averaging of independently fine‑tuned checkpoints improves performance on the same task, and later work adds curvature‑aware or sign‑alignment heuristics to make merging more robust.
Federated learning methods such as FedAvg and Local SGD repeatedly average locally trained models; recent extensions use gradient‑compression, seed‑based full‑parameter tuning, forward‑pass perturbation, low‑rank updates, or personalization to cut communication cost.
One‑shot federated learning removes the iterative sync loop, aggregating after a single local epoch—structurally similar to MERIT’s single‑merge pipeline, yet FL assumes fixed, privacy‑driven partitions while MERIT actively designs conflict‑aware splits on centrally available data.
Gradient interference is a well‑known cause of negative transfer in multi‑task and instruction tuning; prior work mitigates it with gradient reweighting or projection, but these require synchronized access to per‑task gradients.
Vision‑FLAN reports conflicts between short‑answer and conversational instruction sets, and alignment datasets often encode competing objectives (e.g., helpfulness vs. harmlessness), motivating pipelines such as RLHF‑V that reconcile safety and performance.
Large‑scale instruction corpora are assembled by curating heterogeneous datasets and tuning mixture ratios; MERIT adds a reusable decomposition primitive that estimates dataset interactions once, enables communication‑free parallel fine‑tuning, and merges at the end.
Theoretical Framework
Explains why merging fine‑tuned checkpoints in a flat basin improves loss and generalization.
Naïve averaging of fine‑tuned checkpoints can increase loss when the models drift into different basins; the core difficulty is keeping all checkpoints inside a shared flat region.
A carefully prepared checkpoint $\theta$⁽⁰⁾ sits in a wide, low‑loss basin so that any independently fine‑tuned branch stays within the same connected region.
How does “merge‑ready initialization” differ from simply picking a random checkpoint as the starting point?
A random checkpoint may lie near a sharp minimum; averaging branches that start there can create high‑curvature gaps. $\theta$⁽⁰⁾ is deliberately placed in a flat region, guaranteeing that any convex combination of branch weights stays within the same basin, eliminating loss spikes.
The method partitions datasets along directions where their gradients disagree most strongly, which coincide with the model’s stiffest curvature axes; averaging the resulting branches then cancels high‑curvature error components.
Why does applying PCA to gradient conflicts work if we never compute the Hessian directly?
Gradient disagreement between datasets is proportional to $H\Delta_t$; thus the covariance of gradients reflects the Hessian’s eigenstructure. PCA on this covariance therefore uncovers the same high‑curvature directions that would be obtained from an explicit eigen‑decomposition of H, but at far lower cost.
Project displacements onto eigenvectors: u₁ᵀ$\delta$₁ = 2, u₁ᵀ$\delta$₂ = 0; u₂ᵀ$\delta$₁ = 0, u₂ᵀ$\delta$₂ = 1.
Compute weighted variance (uniform weights w₁ = w₂ = 0.5): `Var_w`(u₁ᵀ$\delta$) = 0.5·(2‑1)² = 0.5, `Var_w`(u₂ᵀ$\delta$) = 0.5·(0‑0.5)² + 0.5·(1‑0.5)² = 0.25.
Apply gain formula: Gain = ½·[$\lambda$₁·`Var_w`(u₁ᵀ$\delta$) + $\lambda$₂·`Var_w`(u₂ᵀ$\delta$)] = ½·[4·0.5 + 1·0.25] = ½·(2 + 0.25) = 1.125.
The merged checkpoint $\bar{\theta} = (θ^{(0)} + (δ₁+δ₂)/2) = θ^{(0)} + (1, 0.5)$ lies closer to $\theta$⁽⁰⁾ than either branch individually, illustrating implicit norm regularization.
The gain is dominated by variance along the high‑curvature direction u₁; aligning splits with that direction (as PCA does) yields the largest possible improvement.
**Figure 2.** Local loss surfaces before and after basin preparation (e.g., LLaVA Stage 2). The merge-ready initialization $\theta^{(0)}$ resides in a flat, connected region (right), where independently fine-tuned checkpoints remain within the same basin. Our analysis operates within this flat-basin regime, yielding three key implications that directly motivate MERIT's algorithm design.
**Table 1.** Comparison of displacement and training loss between joint and merged training across different epochs.
The MERIT Pipeline
MERIT partitions heterogeneous data, fine‑tunes each group independently, then merges models via token‑weighted averaging.
Heterogeneous instruction data cause gradient interference, and centralized joint training demands costly synchronization. MERIT sidesteps both issues by splitting the data into conflict‑aware groups that are fine‑tuned separately and then merged.
MERIT treats a mixed dataset like a set of books that are first sorted into genre‑coherent shelves, each shelf is polished independently, and the final library is reassembled by weighting each shelf according to its size.
Compute normalized gradients $\tilde{g}_t$ for each dataset (Step 1).
Form the cosine‑similarity matrix $C$ and run PCA to obtain 2‑D embeddings $z_t$ (Step 2).
Recursively split along the first PCA axis using the median of the weighted samples, producing two halves.
Split each half along the second PCA axis, yielding four groups $\mathcal{G}_1$–$\mathcal{G}_4$ with balanced sample totals (≈ 112 each).
Train a model $\theta_k$ on the union of datasets in each group (Step 4).
Merge the four checkpoints with weights $w_k = N_k / \sum_j N_j$ (Step 5), where $N_k$ are the token budgets of the groups.
The recursive median splits keep groups size‑balanced while still separating datasets that point in opposite gradient directions, which a naïve K‑means clustering would not guarantee.
Imagine sorting a mixed pile of puzzle pieces: pieces that fit together (aligned gradients) stay in the same box, while pieces that pull in opposite directions are placed in different boxes.
How does Conflict‑Aware Partitioning differ from standard K‑means clustering on the same gradient embeddings?
K‑means optimizes Euclidean distance and can produce highly imbalanced clusters, whereas MERIT’s recursive median splits are explicitly sample‑balanced and operate on cosine‑derived axes, guaranteeing that each branch receives a comparable amount of training data while still separating opposing gradient directions.
**Algorithm 1** MERIT: Conflict-Aware Dataset Partitioning and Weight Merging **Require:** Merge-ready initialization $\theta^{(0)}$; datasets $\{\mathcal{D}_t\}_{t=1}^T$ with sample counts $\{s_t\}_{t=1}^T$ and token budgets $\{n_t\}_{t=1}^T$; PCA dimension $r$. **Ensure:** Merged model $\bar{\theta}$. 1: $\triangleright$ Step 1: Gradient conflict estimation at $\theta^{(0)}$. 2: **for** $t = 1, \dots, T$ **do** 3: Compute $g_t$ at $\theta^{(0)}$ under identical training settings (backbone, trainable-parameter subset, gradient-estimation budget); set $\tilde{g}_t \leftarrow g_t / \|g_t\|$. 4: **end for** 5: Form $C \in \mathbb{R}^{T \times T}$ with $C_{ij} = \langle \tilde{g}_i, \tilde{g}_j \rangle$. 6: $\triangleright$ Step 2: PCA-based conflict decomposition. 7: Apply (column-centered) PCA to $C$ and obtain the top-$r$ PCA embedding $z_t \in \mathbb{R}^r$ for each $t$. 8: $\triangleright$ Step 3: Balanced conflict-aware partitioning. 9: Recursively split $\{1, \dots, T\}$ along the $r$ PCA axes via sample-balanced medians (weights $s_t$) into $K = 2^r$ disjoint groups $\{\mathcal{G}_k\}_{k=1}^K$, balancing per-group sample counts $\sum_{t \in \mathcal{G}_k} s_t$; let $N_k = \sum_{t \in \mathcal{G}_k} n_t$ denote the per-group token budget (with $\sum_{k=1}^K N_k = \sum_{t=1}^T n_t$). 10: $\triangleright$ Step 4: Communication-free group-wise training. 11: **for** $k = 1, \dots, K$ **do** 12: Train $\theta_k$ from $\theta^{(0)}$ on $\bigcup_{t \in \mathcal{G}_k} \mathcal{D}_t$ using budgets $\{n_t\}_{t \in \mathcal{G}_k}$. 13: **end for** 14: $\triangleright$ Step 5: Token-weighted parameter-space merging. 15: **return** $\bar{\theta} = \sum_{k=1}^K w_k \theta_k$ with $w_k = N_k / \sum_{j=1}^K N_j$.
**Figure 5.** Two-dimensional t-distributed stochastic neighbor embedding (t-SNE) (Van der Maaten & Hinton, 2008) of dataset-level gradients at $\theta^{(0)}$ (Qwen2.5-VL-3B, 136 Vision-FLAN tasks), overlaid with kernel density estimation (KDE) contours. Each marker is one task. Labeled examples are colored by task type: VQA (yellow), image classification (red), and description/captioning (green).
Empirical Evaluation
MERIT delivers consistent gains across multimodal benchmarks, surpassing joint training under equal budgets.
Instruction tuning on heterogeneous data mixes causes gradient interference, while centralized training is costly; MERIT partitions data into conflict‑aware groups and merges them.
MERIT‑3D improves the average benchmark score by +2.5 points over random partitioning under the same budget.
Table 2 shows MERIT‑3D (44.6 avg.) versus Random (8 groups) (70.7 avg.); the +2.5 improvement isolates the effect of the conflict‑aware split.
**Table.** Performance comparison of different models across various benchmarks, including General MCQA, User Preference & Fluency, Text-Rich VQA, and Image Reasoning.
**Table 3.** Comparison over LLaVA-Series on diverse MLLM benchmarks under two 7B base models. We report each of the eight benchmarks together with their overall mean (Avg., computed only when all eight scores are available). For each base model, we compare further full fine-tuning via centralized Joint FFT against MERIT under a matched training budget; MERIT uses the 2D split with K=4 groups. Bold marks the better of Joint FFT vs. MERIT within the same base. For the 0.7M-base build we report the first seed, and its Joint FFT and MERIT are each validated over three independent seeds in Appendix C.2; the stronger 3.6M-base build is a single run.
**Table 4.** Text-only benchmark results on Qwen2.5-3B with 66 FLAN tasks. Bold indicates the best result in each column, and underline indicates the second-best result in each column.
The substantial gap between MERIT‑3D and Random (8 groups) demonstrates that aligning partitions with gradient‑conflict directions is more beneficial than arbitrary splitting.
Proposition 3.2 predicts a monotonic increase in benefit as the number of PCA dimensions grows, which the experiments confirm: MERIT‑3D (8 groups) outperforms lower‑dimensional splits.
Merge-Readiness Diagnostics
Four diagnostics confirm MERIT’s merge‑ready initialization holds.
MERIT partitions heterogeneous data into conflict‑aware groups, fine‑tunes each group independently, and finally merges the checkpoints. This section tests whether the merged model truly starts from a shared, flat region of the loss landscape.
All pairwise and branch‑to‑merged interpolation paths exhibit a zero loss barrier.
Table 5 reports a barrier of 0.0 for every path in the 2‑D split ($K=4$).
**Table 5.** Linear Mode Connectivity (2D split, $K=4$). All 10 barriers are exactly 0, confirming that branches remain in a shared flat basin with no loss barrier.
Weight‑perturbation robustness shows the merged model is consistently less sensitive to isotropic Gaussian noise ($\sigma\in\{0.01,0.05,0.1\}$) than the jointly trained model across epochs 1–3, indicating a flatter region around the merged optimum.
Displacement contraction follows from convexity: averaging weights pulls the merged parameters $\bar{\theta}_w$ toward the shared initialization $\theta(0)$, and empirically the merged model stays 2–3× closer to $\theta(0)$ throughout training.
The training‑loss gap between joint and merged models widens monotonically.
Table 7 shows the gap growing from +0.489 at epoch 0.5 to +1.266 at epoch 6.0.
**Table 7.** Training loss on the full mixture over epochs.
Additional Analyses
Additional ablations quantify MERIT’s robustness, clustering choices, and merging operators.
MERIT’s conflict‑aware split relies on the observation that datasets whose gradients point in opposite directions naturally fall into separable regions, enabling a PCA‑based partition that groups compatible tasks.
MERIT (3D split, 8 groups) consistently outperforms Joint training across five random seeds.
Table 8 shows per‑run averages of 57.1 vs 54.5, with MERIT higher in every run.
MERIT‑2D retains its advantage over Joint FFT at the 7B scale.
Table 9 reports averages of 55.4 vs 54.9 across three seeds.
MERIT’s PCA‑based partitioning yields higher average scores than K‑means clustering.
Table 10 shows MERIT‑3D achieving an average of 37.7 versus 42.7 for K‑means (k = 8).
Using 200 calibration samples per dataset yields a cosine similarity of 0.847 to the full‑sample reference.
Table 11 reports mean cosine 0.847 (std 0.106) at n = 200, with diminishing returns beyond.
**Figure 3.** Post-hoc merging baselines applied to MERIT's 7B 2D branches (K=4). Each operator replaces MERIT's token-weighted averaging as a drop-in merge step on the same four branches. Bars show the mean over 3 seeds on the 8-benchmark suite; circles mark per-seed scores and error bars denote ±1 std. Token-weighted averaging outperforms all four alternatives; we attribute this to MERIT's branches being complementary by construction, so trimming- or orthogonalization-based operators can discard branch-specific content (Section 6.4).
Qualitative Analysis
Qualitative comparison shows MERIT reduces short‑answer collapse and improves multimodal reasoning.
Joint training on the 3B LLaVA‑Wild setting drops performance from 53.2 to 41.9–42.8, a symptom known as short‑answer collapse where the model emits overly terse replies under heterogeneous data. The same pattern appears at 7B scale, with Joint FFT reducing the score from 67.1 to 50.2, while MERIT retains an open‑ended quality of 66.2. MERIT avoids the collapse by partitioning conflicting subsets before fine‑tuning, leading to more stable, detailed answers (see Table 13).
Across the qualitative examples, joint‑trained models tend to give brief or generic answers, whereas MERIT variants produce richer, context‑aware responses. Increasing MERIT’s decomposition dimensionality (1D → 2D → 3D) often deepens answer detail, especially for multi‑step reasoning or named‑entity recall, as illustrated in Examples 1 and 2. This suggests that higher‑dimensional decomposition enhances multimodal reasoning capability.
**Table 13.** Additional qualitative examples from LLaVA-Wild.
Efficiency Analysis
MERIT’s wall‑clock and similarity costs are quantified across hardware configurations.
MERIT’s overall wall‑clock time is only 0.5 % higher than joint training despite handling many more datasets.
MERIT requires 43 h 39 m versus 43 h 18 m for joint training (both on the 3B/7B models).
The similarity matrix is reusable: adding m new datasets to an existing T‑dataset collection requires only O(T m) cross‑similarity computations plus a negligible PCA update, avoiding a full O((T + m)²) recomputation.
**Table 15.** Wall-clock time breakdown of MERIT and joint training. Experiments are conducted with a 3B model on the V100 system (136 datasets, 3D split, $K=8$) and a 7B model on the A100 system (176 datasets, 3D split, $K=8$, measured with sequential branch execution). The 7B robustness results in Table 9 use the 2D split ($K=4$) configuration.
**Table 1.** Performance comparison of different training methods across various learning rates $\eta$.
The table compares three methods (PCGrad, GradNorm, and MERIT) across three criteria: "When resolved", "Per-step overhead", and "Communication".
Scaling and Conclusion
MERIT consistently outperforms joint training across multimodal and text‑only benchmarks while enabling decentralized training.
MERIT improves the 8‑benchmark average on the base 7B VLM from 54.9 to 55.4.
Table 3, base VLM comparison with centralized Joint FFT.
With a stronger base model, MERIT raises the average from 60.9 to 61.5.
Table 3, scaled‑base VLM comparison under identical mixture.
MERIT preserves open‑ended generation quality (66.2) while Joint FFT collapses to 50.2.
Table 3, LLaVA‑Wild open‑ended scores (Appendix D.1).
MERIT improves MathVista reasoning from 39.4 to 43.0.
Table 3, reasoning‑heavy benchmark results.
MERIT‑2D (4 groups) attains the highest average of 58.4 on 66 FLAN tasks, a +0.1 gain over 2‑epoch joint training.
Table 4, Qwen2.5‑3B text‑only results.
Across all multimodal and text‑only experiments, MERIT’s conflict‑aware partitioning consistently yields higher averages than centralized joint fine‑tuning, while remaining competitive on individual benchmarks.
MERIT is designed for bandwidth‑constrained, decentralized environments: each branch fine‑tunes independently, eliminating step‑level synchronization, and a single token‑weighted merge completes training.
In summary, MERIT’s split‑and‑merge pipeline scales to 7 B‑parameter vision‑language models, works with stronger initializations, and generalizes to pure text instruction tuning, delivering consistent average improvements while enabling communication‑free parallel training.
Experimental Details
Detailed configurations for datasets, benchmarks, baselines, and implementation of MERIT experiments.
This appendix records the experimental configuration that underlies every result in the paper. All settings are shared unless a subsection explicitly overrides them.
Training data are split into vision‑language and text‑only streams. Vision‑language tuning uses Vision‑FLAN; we select 136 of its 187 tasks for the 3B experiments. Text‑only tuning uses the FLAN dataset with 66 instruction tasks.
Evaluation covers eight multimodal benchmarks (SeedBench, MMBench, LLaVA‑Wild, MMVet, TextVQA, AI2D, MathVista, MMMU) and eight text‑only benchmarks (MMLU, HellaSwag, WinoGrande, ARC‑C, HumanEval, BoolQ, GPQA, XNLI). A broader 11‑benchmark protocol is used for the clustering‑strategy ablation.
Four baselines are compared against MERIT. Joint training runs the full mixture for 0.5, 1, or 2 epochs with per‑device batch size tuned by grid search. Random Split partitions the data into 2, 4, or 8 groups and averages weights after independent training. Uniform Soup averages 2–4 independently trained models. Conflict‑Induced Split greedily groups tasks to maximize inter‑group gradient disagreement before independent training and averaging.
Language‑only experiments fine‑tune Qwen2.5‑3B, updating only the language model parameters while keeping all other settings identical to the MERIT pipeline.
Vision‑language experiments fine‑tune Qwen2.5‑VL‑3B‑Instruct. The vision encoder and multimodal projector are frozen; only the language‑model decoder is updated. Images are resized to a maximum of 784 × 784 pixels.
The 7B‑scale study builds a base vision‑language model by pairing the Qwen2.5‑VL encoder with the Qwen2.5‑7B‑Instruct language model, then follows a LLaVA‑style two‑stage recipe. A scaled variant adds a high‑quality knowledge‑learning stage before instruction tuning.
Both base checkpoints are frozen during the subsequent full fine‑tuning (FFT) on a 1.6 M‑example multimodal mixture that aggregates 176 task‑unit sources.
Data sources for the 1.6 M mixture are publicly hosted on Hugging Face and include LVIS‑Instruct4V, Vision‑FLAN, The Cauldron, LLaVA‑Human‑Preference‑10K, LrvInstruction, GPT‑4V, and VLGuard.
All datasets are converted to a unified multimodal instruction format (image + instruction + response). When a source provides multiple task‑specific subsets, each subset is kept as a separate unit, yielding exactly 176 units in the final mixture.
Datasets are released under their original licenses; full training configurations and scripts will be open‑sourced. Multi‑seed reproducibility for the 7B results is reported in Appendix C.2.
Theoretical Analysis I
The appendix formalizes MERIT’s variance‑reduction, PCA‑aligned splitting, and feasibility of decentralized training.
MERIT‑3D finishes training on a V100 in 5 h 24 m, which is 3 h 16 m faster than a 2‑epoch joint run (8 h 40 m) and only 1 h 2 m slower than a single‑epoch joint run (4 h 22 m).
At the 7 B scale on an A100, MERIT adds just 21 minutes (0.8 %) over the 1‑epoch joint baseline (43 h 39 m vs. 43 h 18 m), while its one‑time preprocessing eliminates the need for synchronous gradient communication, making it attractive for decentralized or bandwidth‑limited settings.
Centralized per‑step gradient conflict methods become infeasible at our scale (136 tasks, 3 B parameters): PCGrad would store ~816 GB of gradients (exceeding both V100×8 and A100×8 memory) and would take over 17 days per epoch, while GradNorm requires all 136 loss terms each step and has only been demonstrated on far smaller models.