Latent Reasoning with Normalizing Flows
NF-CoT replaces verbose textual reasoning with continuous latent states modeled by autoregressive normalizing flows.
Can we replace discrete, serial chain-of-thought text with continuous latent states generated by normalizing flows to improve reasoning efficiency and diversity?
Chain-of-thought reasoning forces models to verbalize every intermediate step as text, which is computationally expensive and ties reasoning to the limitations of natural language tokens. NF-CoT introduces a latent reasoning framework that models continuous thoughts using autoregressive normalizing flows within the language model's own causal stream. This allows the model to sample and score reasoning trajectories as compact continuous states before decoding the final answer. On code generation benchmarks, this approach improves pass rates over explicit text-based reasoning while significantly reducing the compute cost of intermediate reasoning steps.
Paper Primer
NF-CoT treats continuous thoughts as first-class citizens by placing a normalizing flow head directly inside the LLM backbone. The core move is to reparameterize latent thoughts into a space that supports exact likelihood estimation and left-to-right autoregressive sampling, allowing the model to generate reasoning and answers in a single causal pass.
NF-CoT significantly outperforms explicit chain-of-thought and prior latent-reasoning baselines on code generation tasks.
On the Qwen3-8B-Base backbone, NF-CoT (Unified) improved average pass@1 from 55.8% to 68.8%.
The method achieves substantial gains in computational efficiency compared to diffusion-based latent reasoners.
NF-CoT (Unified) is 2.70x faster at latent generation and 2.48x cheaper in per-sample compute than the LaDiR baseline.
Why use normalizing flows instead of the diffusion models used in other latent reasoning papers?
Diffusion models require iterative denoising steps, which are computationally expensive and lack a native left-to-right likelihood interface. Normalizing flows provide exact likelihoods and allow for efficient, single-pass autoregressive sampling that integrates seamlessly with the LLM's existing KV cache.
Does this method still produce human-readable reasoning traces?
No. The continuous latent states are not human-readable; the authors treat decoded latent CoTs only as qualitative probes rather than faithful natural-language explanations of the model's internal reasoning.
The Bottleneck of Discrete Reasoning
Why textual chain‑of‑thought is a serial bottleneck and how NF‑CoT proposes a continuous alternative.
Explicit chain‑of‑thought forces every reasoning step to be verbalized as a token before the model can continue, creating a low‑bandwidth, serial bottleneck. This makes long, complex reasoning costly and ties intermediate computation to surface text.
Textual CoT forces reasoning through a sparse, sequential token stream, inflating cost and preventing non‑serial updates; a continuous latent stream could carry richer information per step.
**Figure 1.** Four paradigms for chain-of-thought reasoning. Explicit CoT: discrete text tokens. Coconut: deterministic hidden states. LaDiR: iteratively denoised latents using diffusion. NF-CoT (ours): AR-sampled continuous thoughts.
Discrete CoT limits reasoning to serial token generation.
Empirical Performance and Scaling
NF‑CoT delivers large pass@1 gains and faster latent generation.
NF‑CoT (Unified) achieves the best average pass@1, raising Qwen3‑8B‑Base from 55.8 % to 68.8 % (+13.0 %).
Table 1 shows NF‑CoT (Unified) leading across all five code benchmarks, with the largest margin over the strongest open‑source baseline.
**Figure 3.** Pass@k scaling on MBPP+ and HumanEval+. NF-CoT outperforms the base model and LaDiR, and continues to improve with larger k.
**Figure 4.** Pass@k diversity before and after reinforcement learning on MBPP+ and HumanEval+. Top row: standard token-space GRPO improves the low-k region but saturates at larger k, failing to provide the same large-sample coverage as the base model. Bottom row: latent-space RL for NF-CoT improves pass@1 while preserving the upward pass@k scaling trend, indicating that policy-gradient refinement in continuous-CoT space does not collapse the latent trajectory distribution.
**Figure 5.** Effect of continuous-CoT perturbation strength on HumanEval generation. As $\sigma$ increases, the perturbed continuous-CoT trajectory becomes nearly orthogonal to the original trajectory, and exact-text match drops sharply. Nevertheless, pass@1 remains nearly flat, indicating that local perturbations change the form of the generated program much more than its functional correctness, even at the largest perturbation strengths tested.
**Figure 6.** Pairwise output similarity among passing HumanEval programs. Each panel contains 32 = 4 × 8 passing samples from four HumanEval problems, with intra-task blocks on the diagonal and cross-task entries set to zero to isolate within-prompt structure. Similarity is computed using a structure-aware AST metric; lower values indicate higher diversity. NF-CoT achieves lower mean intra-prompt similarity than the Qwen3-8B-Base teacher, indicating more structurally diverse passing solutions under the same answer temperature.
NF‑CoT outperforms discrete CoT and LaDiR on code benchmarks.
Related Work and Reasoning Paradigms
NF‑CoT swaps discrete token steps for continuous latent states generated by a normalizing flow, removing the serial bottleneck of standard chain‑of‑thought.
Standard chain‑of‑thought (CoT) forces the model to emit each reasoning step as a separate token, creating a strict left‑to‑right bottleneck. NF‑CoT replaces that sequence with continuous latent states produced by a normalizing flow, allowing richer, non‑serial intermediate computation.
Discrete CoT generates a textual reasoning chain, where each intermediate thought is a token that the model must process in order.
LaDiR is a latent‑reasoning baseline that decodes a fixed latent vector into a chain of thoughts, incurring extra computation compared to token‑based CoT.
We now situate NF‑CoT among prior approaches to chain‑of‑thought and latent reasoning.
Prompting technique that elicits intermediate reasoning steps as text before producing the final answer.
Approach that feeds continuous latent vectors back into the model instead of discrete tokens, aiming to capture richer intermediate representations.
Uses the Gumbel‑Softmax trick to turn discrete token choices into differentiable soft embeddings, enabling stochastic latent reasoning.
Variational auto‑encoder framework that learns a latent distribution; some works denoise this space to guide reasoning.
Invertible flow models that map a simple base distribution to sequences, preserving the causal order of Transformers.
Caveats and Failure Modes
We enumerate prior works that inform the study’s limitations.
Key prior works include Jacob Austin et al. (2021) on program synthesis, Natasha Butt et al. (2025) on soft tokens, Mark Chen et al. (2021) on code evaluation, Jingcheng Deng et al. (2026) on latent reasoning, Laurent Dinh et al. (2014) on NICE, Laurent Dinh et al. (2016) on Real NVP, Tianyu Fu et al. (2025) on selective latent iterations, Jonas Geiping et al. (2026) on scaling test‑time compute, Shansan Gong et al. (2025) on Diffucoder, Jiatao Gu et al. (2025) on StarFlow‑V, Jiatao Gu et al. (2026a) on high‑resolution image synthesis, Jiatao Gu et al. (2026b) on normalizing trajectory models, Daya Guo et al. (2025) on DeepSeek‑R1, Shibo Hao et al. (2024) on continuous latent reasoning, Jonathan Ho et al. (2020) on diffusion models, Siming Huang et al. (2025) on OpEncoder, Hugging Face (2025) on Open R1, Binyuan Hui et al. (2024) on Qwen2.5‑Coder, Naman Jain et al. (2025) on LiveCodeBench, Eric Jang et al. (2016) on Gumbel‑Softmax, Haoqiang Kang et al. (2025) on LaDiR, Haoqiang Kang et al. (2026) on diversity‑preserving RL, Durk P Kingma & Prafulla Dhariwal (2018) on Glow, Durk P Kingma et al. (2016) on IAF, Takeshi Kojima et al. (2022) on zero‑shot reasoning, Jiawei Liu et al. (2023) on code correctness, Xuezhe Ma et al. (2019) on FlowSeq, Chris J Maddison et al. (2016) on the Concrete distribution, Shen Nie et al. (2026) on large language diffusion, Maxwell Nye et al. (2021) on Scratchpads, George Papamakarios et al. (2017) on masked autoregressive flow, Danilo Rezende & Shakir Mohamed (2015) on variational inference, ByteDance Seed et al. (2025) on Seed‑Coder, Zhihong Shao et al. (2024) on DeepSeekMath, Ying Shen et al. (2026) on StarFlow2, Charlie Snell et al. (2024) on test‑time compute scaling, DiJia Su et al. (2025) on token assorted, Yao Tang et al. (2026) on multiplex thinking, Ling Team et al. (2025) on mixture‑of‑experts, Qixun Wang et al. (2026) on Monet, Xuezhi Wang et al. (2022) on self‑consistency, Jason Wei et al. (2022) on chain‑of‑thought prompting, Junhong Wu et al. (2025) on single‑threaded reasoning, Zhihui Xie et al. (2025) on Dream‑Coder, Zhangchen Xu et al. (2025) on Kodcode, An Yang et al. (2025) on Qwen3, Shunyu Yao et al. (2023) on Tree of Thoughts, Jiacheng Ye et al. (2025) on Dream 7B, Eric Zelikman et al. (2022) on STAR, Huaye Zeng et al. (2025) on AceCoder, Shuangfei Zhai et al. (2024) on normalizing flows capability, Ruixiang Zhang et al. (2026a) on transformer‑based autoregressive flows, Zhen Zhang et al. (2026) on soft thinking, Siyan Zhao et al. (2026) on D1, Yuyan Zhou et al. (2026) on Lepo, Rui‑Jie Zhu et al. (2025) on scaling latent reasoning, and Zachary Ziegler & Alexander Rush (2019) on latent normalizing flows.
Qualitative Reasoning Examples
Ablation study comparing dual‑path and unified‑path NF‑CoT variants.
We compare two concrete design choices for NF‑CoT: a dual‑path variant that runs the backbone twice per step, and a unified‑path variant that collapses both passes into a single causal sequence.
The model runs the LLM twice per gradient step, keeping separate projectors and boundary tokens for the flow and the cross‑entropy (CE) paths.
Forward 1: backbone → hidden₁; `flow_projector`(hidden₁) → latent flow states; NF NLL computed.
Forward 2: backbone (same weights) → hidden₂; vae.`latent_to_decoder`(hidden₂) → latent CE states; CE cross‑entropy computed.
Gradients from both losses are summed and back‑propagated through the shared backbone.
The second forward doubles compute but isolates the two losses, allowing independent tuning of the flow and CE projectors.
Why not share a single projector for both paths?
Sharing would couple the NF density estimation and the CE prediction, preventing the model from learning distinct representations needed for accurate likelihood modeling versus token prediction. The dual‑path design keeps these objectives orthogonal while still reusing the backbone.
The model merges the two passes into one causal sequence, eliminating the second backbone forward.
Positions
Both losses are summed; gradients flow back through the single backbone forward.
By sharing the backbone forward the model cuts compute roughly in half while preserving the representational capacity of the flow.
Does merging the paths hurt performance because the same hidden states serve two different losses?
The paper reports that the unified‑path variant matches the dual‑path accuracy while halving inference time, indicating that the invertible MetaBlocks successfully decouple the two objectives despite sharing hidden states.
Both variants use the same backbone (Qwen3‑8B‑Base), latent geometry ($N=64$, $D=2560$), and two‑stage curriculum, differing only in the conditioning sequence and which projectors are trainable in Stage 1.
Training hyperparameters include a frozen VAE encoder, dequantization noise $\sigma_{dq}=0.3$, and a flow‑density head built from five invertible MetaBlocks with channel width 2048 and head dimension 64.
Inference proceeds in two phases: Phase 1 samples latent noise $z\sim\mathcal{N}(0,T^{2})$, runs the NF reverse pass with KV‑cache reuse, and Phase 2 feeds the resulting latent prefix to the backbone via vLLM for answer decoding.
Execution‑guided RL fine‑tunes only the shared backbone while keeping all flow components frozen, using a combined token‑level and latent‑level PPO objective over 150 RL steps.