SCAIL-2: Unifying Controlled Character Animation with End-to-End In-Context Conditioning

Wenhao Yan, Fengjia Guo, Zhuoyi Yang, Jie Tang

SCAIL-2 enables end-to-end character animation by unifying tasks through in-context conditioning and bias-aware preference optimization.

How can we animate characters by directly conditioning on reference images and driving videos, bypassing the brittle intermediate pose-extraction steps used in prior work?

Existing character animation methods rely on intermediate representations like pose skeletons, which lose spatial information and fail during complex interactions or non-human driving. SCAIL-2 bypasses these intermediates by directly concatenating driving videos to the sequence, using in-context mask conditioning and mode-specific positional shifts to unify animation and replacement tasks. This end-to-end approach outperforms state-of-the-art models in cross-identity motion following and environment integration, achieving performance near proprietary services like Kling 3.0.

Paper Primer

The core mechanism hinges on a "reverse driving" training paradigm: the model is trained on synthetic pairs where a real video serves as the ground-truth target, while a re-synthesized version of that same video acts as the driving input. This allows the model to learn motion transfer without the information loss inherent in skeleton-based extraction.

To handle diverse sub-tasks, the model uses In-Context Mask Conditioning: it stacks additional channels as "binding slots" that explicitly route motion from specific driving characters to their corresponding targets, while an environment switch channel dictates whether the background should be derived from the reference or the driving source.

SCAIL-2 achieves superior motion consistency and identity isolation in zero-shot multi-character animation.

Subjective human evaluation (GSB) shows the model consistently outperforming baselines like Wan-Animate and MultiAnimate in identity consistency and physical plausibility. In multi-character scenarios, the model achieves up to 93.3% win rates against competitors in identity isolation metrics.

Bias-Aware Direct Preference Optimization (DPO) effectively corrects fine-grained motion errors.

By constructing preference pairs that isolate pose-estimation errors in hand regions, the model refines articulation without requiring additional manual labels. The post-training scheme improves fine-grained details like finger articulation and mouth movement, even though the loss is computed primarily on hand masks.

Why is an end-to-end approach better than using pose skeletons?

Pose skeletons suffer from ambiguity in complex interactions and discard essential spatial information; end-to-end conditioning preserves the full visual context, including occlusions and environment details, which are critical for realistic multi-character animation.

What is the role of the "binding slots" in the model?

Binding slots are input channels that explicitly map motion from a driving character to a specific target character, preventing identity leakage when multiple characters move or swap positions in the frame.

Introduction and Motivation

We expose the brittleness of pose‑based pipelines and motivate end‑to‑end conditioning for character animation.

Prior character‑animation pipelines insert brittle intermediate representations—pose skeletons or masked backgrounds—that discard visual detail. This loss becomes severe in complex interactions, multi‑character scenes, or when the driver is non‑human, because the pose estimator is ambiguous or fails. We therefore propose an end‑to‑end conditioning paradigm that feeds the raw driving video directly into the model, preserving all visual cues and unifying diverse animation tasks.

The core issue is that discarding the full visual context via pose or mask intermediates deprives the model of essential spatial and temporal cues, limiting fidelity and generalization.

**Figure 1.** SCAIL-2 unifies various character animation tasks within an end-to-end paradigm.

Moving from pose‑based intermediates to end‑to‑end conditioning eliminates information loss and enables robust animation across diverse tasks.

End-to-End Architecture

We replace brittle pose intermediates with a direct conditioning mechanism that unifies animation tasks.

The latent diffusion backbone encodes an input video $x$ into $z_0=E(x)$ and then corrupts it with Gaussian noise $q(z_t\mid z_{t-1})=\mathcal{N}\bigl(z_t;\sqrt{1-\beta_t}\,z_{t-1},\beta_t I\bigr)$. Pose‑driven pipelines insert explicit pose intermediates, which break when occlusions or environment changes occur. This brittleness motivates a fully end‑to‑end conditioning that skips the pose step entirely.

Instead of extracting a pose sequence, the model ingests the raw reference image and driving video together, letting the diffusion process learn a direct mapping to the target animation.

How does this differ from a traditional pose‑driven pipeline that also conditions on a reference image?

Traditional pipelines first run a pose estimator $P(y)$, discarding raw pixel information and forcing a fixed skeleton representation. End‑to‑End Conditioning keeps the full latent video $z_{\text{driv}}$, so the model can attend to texture, occlusion, and background cues that a pose vector cannot capture.

Mask channel $Ch_0$ is set to 1 for background pixels and 0 for character pixels in both latents.

Character channels $Ch_1$ are set to 1 on the reference character region and 0 elsewhere.

The three tensors are concatenated along the channel dimension, forming a $3\times4\times4$ conditioning tensor.

The denoising network receives this tensor at timestep $t=50$ and predicts the noise $\epsilon_\theta$.

After 100 diffusion steps the decoder $D$ produces a video where the reference character follows the motion of $y$ while preserving background details.

The example shows that no explicit pose is ever computed; the model learns to align motion to the character purely from latent similarity.

The mask tells the model which parts of the latent belong to the environment and which belong to each character, enabling separate control of background and actors.

Why not simply concatenate the raw latents without a mask?

Without a mask the model cannot distinguish which latent regions should stay static (background) and which should be animated (characters). The mask provides an explicit separation, preventing the diffusion process from inadvertently altering the environment.

Extract motion from the driving video and route it exclusively to the target character specified by the binding map $\pi$.

What would happen if the binding map $\pi$ were omitted?

Without $\pi$, the model would have no way to know which target character should receive which motion, leading to blended or swapped motions across characters, especially in multi‑character scenarios.

Read the chosen environment source $E$ and seamlessly blend characters into that background.

Why is a separate environment mask needed instead of letting the model infer background implicitly?

The mask gives the model an explicit signal about which voxels belong to the static background, preventing it from accidentally altering background textures while focusing on character motion.

Separate pose information from identity so that any driving motion can be applied to any target character without leaking identity cues.

What failure mode does Universal Transfer address?

Without this objective, the generated video may retain subtle facial features or clothing textures from the driving character, causing identity leakage into the target character.

Using the three objectives we define two concrete sub‑tasks. In Character Image Animation the environment source is the reference ($E=E_I$) and each driving character transfers its motion to a bound target. In Character Replacement the environment source is the driving video ($E=E_y$), allowing the background to follow the driving scene while swapping characters.

**Figure 2.** SCAIL-2 adopts end-to-end driving paradigm to bypass unreliable animation intermediates.

**Figure 4.** Overview of our model architecture and the context mask signal. $Ch_0$ means the mask channel for environment control, while $Ch_1$ to $Ch_K$ denote $K$ channels for character binding. The environment mask is the complement of the union of either the driving or the reference character masks.

The synthetic animation loop creates training pairs without manual annotation. Each iteration samples a driving video, selects a candidate reference image, weaves a prompt, generates a synthetic video with model $M$, and validates it with a quality checker before storing the pair.

Sample a driving video $y$ from the dataset.

Run the Candidate Selector to pick the most compatible reference image $I$.

Compose a textual prompt describing the desired character, pose, and background.

Feed the prompt and the first frame of $y$ into the multi‑reference generator $M$ to obtain a synthetic video $\tilde{y}$.

Apply the Quality Checker to filter out low‑fidelity outputs; if rejected, iterate the prompt.

Store the accepted pair $(I,\tilde{y})$ for end‑to‑end training.

Why are multiple editing turns required after the initial generation?

Because the generator $M$ may produce artifacts (e.g., mismatched lighting or stray background elements). Iterative prompt refinement lets the Quality Checker steer $M$ toward a cleaner composition that respects both character and environment constraints.

Animation Generation and Selection

We detail the Moon Pool selector and supporting in‑context mechanisms that enable end‑to‑end conditioning.

Pose‑driven pipelines rely on brittle intermediate pose representations, which hampers scalability and robustness.

The selector acts like a matchmaker that, given a reference posture frame, queries a vision‑language model to pick the most compatible character from a pool.

Compute cosine similarity: $s_1=0.68$, $s_2=0.99$, $s_3=0.36$.

Select $C_2$ because $s_2$ is maximal.

Return $C_2$ as the Moon Pool candidate for subsequent processing.

The selector prefers the character whose semantic description aligns best with the reference posture, even if raw pixel similarity would favor a different candidate.

How does the Moon Pool Selector differ from a simple nearest‑neighbor search on raw pixels?

Nearest‑neighbor compares low‑level pixel values, which are sensitive to lighting and background. The Moon Pool Selector uses a vision‑language model that evaluates high‑level semantic similarity, making the match robust to visual variations.

Pose‑driven baselines generate animation by explicitly transferring pose parameters from a reference image to a target video.

Mask conditioning augments the visual tokens with extra channels that tell the model which regions belong to which sub‑task, acting like a layered stencil that guides attention.

Our model follows the In‑Context Driving design: the concatenated token stream $[z_{\text{ref}};z_t;z_{\text{driv}}]$ is fed to an Image‑to‑Video (I2V) backbone.

**Figure 3.** The overview of our synthetic pipeline for curating diverse high-quality cross-identity motion pairs.

**Table 1.** 3D RoPE coordinates assigned to $z_{ref}$, $z_t$ and $z_{driv}$. The video latent has shape $T_v \times H_v \times W_v$.

Reverse Driving synthesizes a driving video $\tilde{y}$ from a real video $y$ by re‑creating characters via pose transfer or one‑by‑one replacement; the synthetic $\tilde{y}$ then drives the denoising model while $y$ serves as the clean target $x$.

Implementation and Training

Key quantitative gains of the 14B I2V model across video animation benchmarks.

We train a 14B Image-to-Video (I2V) backbone on 64 H100 GPUs for a week, then apply Direct Preference Optimization (DPO) post‑training and in‑context conditioning with $K=6$.

The model undergoes a two‑stage training: large‑scale pre‑training followed by preference‑aligned fine‑tuning.

Performance is measured both objectively (pixel‑level similarity, perceptual distance, video fidelity) and subjectively (human judgment of identity preservation).

Our method attains an SSIM of 0.6453, exceeding the next‑best baseline by 0.0046.

Ours SSIM = 0.6453; SCAIL SSIM = 0.6407 (Table 1).

These results demonstrate that end‑to‑end conditioning delivers consistent gains across both perceptual similarity and video fidelity metrics.

Performance Evaluation

Human and automatic evaluations show SCAIL‑2 outperforms prior methods on single‑ and multi‑character animation.

Recall that SCAIL‑2 replaces brittle pose‑driven pipelines with an end‑to‑end conditioning framework that maps reference images and driving videos directly to motion.

SCAIL‑2 wins 68.3 % of the pairwise comparisons on Motion Consistency in single‑character human evaluation, beating the closest baseline by 23.3 %.

Figure 5 (human evaluation on Studio‑Bench) reports a 68.3 % win rate for SCAIL‑2 versus SCAIL and a 23.3 % advantage over the latter.

On X‑dance, SCAIL‑2 attains the highest Imaging Quality score (4.43), improving over the next best method by 0.06.

Table 3 (Video‑Bench automatic evaluation) lists 4.43 for “Ours” versus 4.27 for the runner‑up SCAIL.

**Figure 5.** Single-character human evaluation on Studio-Bench. Kling 3.0 denotes Kling 3.0 Motion Control (Team et al. 2026).

**Figure 6.** Multi-character human evaluation on Studio-Bench.

**Figure 8.** Qualitative comparison against baselines under cross-identity inputs.

Ablation Studies

Ablation studies reveal which modules and data are essential for realistic animation.

This section systematically removes each architectural component or data source and reports the resulting degradation, confirming that every element contributes to the overall visual fidelity.

**Figure 9.** Ablation studies on network modules and data.

Additional Quantitative Ablations

Quantitative ablations reveal each module’s impact on multi‑character animation quality.

We evaluate the contribution of Binding Slots and Replacement Data on the cross‑identity multi‑character animation benchmark of Studio‑Bench. Table 5 reports the results for three ablations.

**Table 5.** Quantitative ablations on multi-character animation.

Removing Binding Slots reduces Appearance Consistency.

Appearance Consistency drops from 4.13 to 4.10 when the slots are omitted.

Removing Replacement Data lowers Imaging Quality.

Imaging Quality falls from 4.63 to 4.17 without the replacement data module.

Bias-Aware Preference Optimization

Bias-Aware DPO trains the model to correct pose‑estimation errors by treating them as negative preferences.

Pose estimators often mis‑articulate fine hand movements, and those errors propagate through the synthetic pipeline, biasing the end‑to‑end training data. Bias‑Aware DPO injects these errors as explicit negative preferences so the model learns to correct them during post‑training.

We turn the systematic hand‑joint mistakes of pose estimators into a learning signal: the model is rewarded for preferring a clean synthetic video over one that has been corrupted by an extra error round.

Compute $r = G(p, R)$ → red video with hand angle $0.9$.

Compute $s = G(p, S)$ → blue video with identical hand angle.

Extract a pose from $r$: $P(r) = (0.85, 0.15)$ (slight degradation).

Regenerate $r^{-} = G(P(r), R)$ → red video where the hand angle is now $0.85$, i.e., a subtle drift.

Form the preference tuple $(s, R, r, r^{-})$; DPO will reward the model for producing $r$ over $r^{-}$.

The extra error manifests as a small shift in the hand joint, which the model learns to suppress, improving fine‑grained motion fidelity.

**Figure 10.** Qualitative comparison of our Bias-Aware DPO against the SFT variant and the base model.

How does Bias‑Aware DPO differ from standard DPO that simply prefers a reference model?

Standard DPO treats the reference model’s output as the sole positive example. Bias‑Aware DPO explicitly constructs a negative example by re‑applying the pose pipeline, so the model learns to undo the systematic hand‑joint errors rather than just imitate a static reference.

Limitations and Discussion

We discuss the data dependence limits and broader implications of our end‑to‑end framework.

Our end‑to‑end design feeds the model complete visual information, but it fundamentally relies on large‑scale, high‑quality paired training data. The synthetic pipeline mitigates data scarcity, yet the fidelity of generated pairs still depends on the capability of the underlying generators.

We employ Bias‑Aware DPO to model preference against bias, but obtaining reliable positive samples for fine‑grained regions remains difficult. Improving generator quality or pipeline efficiency could enable higher‑fidelity data for tasks such as lip‑syncing and detailed facial expressions.

The first source of gain comes from end‑to‑end training itself: a DiT equipped with strong priors extracts and converts information from visual contexts, allowing the model to generalize to a broader range of zero‑shot inputs.

The second gain stems from the unified reverse‑driving pipeline. Reverse driving and concept decoupling let the model isolate distinct information types and develop strong compositional abilities; because real videos provide authentic supervision, optimization is steered toward plausible composition, surpassing the synthetic generators.

Our framework is positioned to benefit from future advances in data synthesis and supervision strategies rather than being rendered obsolete by them.

**Figure 7.** Human evaluation on Studio-Bench for character replacement.

**Table 4.** Composition of **MotionPair-60K**, along with the additional pose-driven dataset, and their corresponding sampling ratios used during training.

**Figure 12.** Examples of complex cross-body-shape character image animation. Our method maintains decent character consistency under complex motions.

**Figure 13.** Examples requiring fine-grained HOI. Our method simultaneously preserves correct character identity and fine-grained objects (e.g., thin sticks) during interaction. Zoom-in for better details.

**Figure 14.** Examples of complex multi-character interactions. Our method accurately captures the interaction relationships among multiple characters with proper identity isolation.

Preference Dataset Details

Appendix B details the preference dataset and bias‑aware DPO fine‑tuning pipeline.

B.1 Details of Preference Dataset – we synthesize a positive sample r with the high‑fidelity estimator SDPose and a negative sample r⁻ by degrading the same extraction passes with the lower‑quality ViTPose, keeping reference image R and global motion identical.

B.2 Bias‑Aware DPO Implementation – the DPO objective compares flow‑matching errors of a trainable model $v_{\theta}$ against a frozen reference $v_{\text{ref}}$, but only inside hand regions defined by a mask $M$.

Training details – we freeze the backbone and fine‑tune only LoRA adapters (rank 128) with learning rate $1\!\times\!10^{-4}$, batch size 24, and DPO temperature $\beta=5000$.

Discussion – although the mask $M$ up‑weights hand errors, the gradient updates the entire model, which also improves other fine‑grained regions such as the mouth, outperforming a weighted‑SFT baseline that only fits positives.

**Figure 11.** Distribution of data source.

Evaluation Metrics

Evaluation metrics and protocols used for quantitative and human assessment.

We first report pose‑driven quantitative metrics that compare generated motion against ground‑truth poses. These metrics are standard in the field and provide an objective baseline for video quality.

Four widely used image‑video quality measures are employed: PSNR, SSIM, LPIPS, and FVD. For the mesh shown in Table 2 we use the standard MHR‑format Grey Mesh as the reference geometry.

Human evaluation focuses on cross‑identity scenarios. Motion Accuracy quantifies frame‑by‑frame adherence to the driving signal, while Identity Consistency checks that the subject’s appearance matches the reference image.

For single‑character animation we assess Physical Plausibility, penalizing violations of gravity, support, or momentum, and disallowing artifacts such as hovering or body penetration.

In multi‑character scenes we measure Identity Isolation to ensure limbs and clothing of different characters remain distinct and do not merge unnaturally.

For replacement scenarios we evaluate Environment Integration, checking that newly inserted characters blend naturally with the original scene and preserve background and object interactions.

When the original video quality is a reliable proxy, we adopt VideoBench’s human‑aligned automatic protocol (as used in DreamActor‑M2). It scores four perceptual dimensions—Imaging Quality, Motion Smoothness, Temporal Consistency, and Appearance Consistency—on a 1–5 scale.

Read the original paper

Open the simplified reader on Paperglide