Latent Reasoning with Normalizing Flows

NF-CoT replaces verbose textual reasoning with continuous latent states modeled by autoregressive normalizing flows.

Can we replace discrete, serial chain-of-thought text with continuous latent states generated by normalizing flows to improve reasoning efficiency and diversity?

Chain-of-thought reasoning forces models to verbalize every intermediate step as text, which is computationally expensive and ties reasoning to the limitations of natural language tokens. NF-CoT introduces a latent reasoning framework that models continuous thoughts using autoregressive normalizing flows within the language model's own causal stream. This allows the model to sample and score reasoning trajectories as compact continuous states before decoding the final answer. On code generation benchmarks, this approach improves pass rates over explicit text-based reasoning while significantly reducing the compute cost of intermediate reasoning steps.

Paper Primer

NF-CoT treats continuous thoughts as first-class citizens by placing a normalizing flow head directly inside the LLM backbone. The core move is to reparameterize latent thoughts into a space that supports exact likelihood estimation and left-to-right autoregressive sampling, allowing the model to generate reasoning and answers in a single causal pass.

NF-CoT significantly outperforms explicit chain-of-thought and prior latent-reasoning baselines on code generation tasks.

On the Qwen3-8B-Base backbone, NF-CoT (Unified) improved average pass@1 from 55.8% to 68.8%.

The method achieves substantial gains in computational efficiency compared to diffusion-based latent reasoners.

NF-CoT (Unified) is 2.70x faster at latent generation and 2.48x cheaper in per-sample compute than the LaDiR baseline.

Why use normalizing flows instead of the diffusion models used in other latent reasoning papers?

Diffusion models require iterative denoising steps, which are computationally expensive and lack a native left-to-right likelihood interface. Normalizing flows provide exact likelihoods and allow for efficient, single-pass autoregressive sampling that integrates seamlessly with the LLM's existing KV cache.

Does this method still produce human-readable reasoning traces?

No. The continuous latent states are not human-readable; the authors treat decoded latent CoTs only as qualitative probes rather than faithful natural-language explanations of the model's internal reasoning.

The Bottleneck of Discrete Reasoning

Why textual chain‑of‑thought is a serial bottleneck and how NF‑CoT proposes a continuous alternative.

Explicit chain‑of‑thought forces every reasoning step to be verbalized as a token before the model can continue, creating a low‑bandwidth, serial bottleneck. This makes long, complex reasoning costly and ties intermediate computation to surface text.

Textual CoT forces reasoning through a sparse, sequential token stream, inflating cost and preventing non‑serial updates; a continuous latent stream could carry richer information per step.

**Figure 1.** Four paradigms for chain-of-thought reasoning. Explicit CoT: discrete text tokens. Coconut: deterministic hidden states. LaDiR: iteratively denoised latents using diffusion. NF-CoT (ours): AR-sampled continuous thoughts.

Discrete CoT limits reasoning to serial token generation.

Empirical Performance and Scaling

NF‑CoT delivers large pass@1 gains and faster latent generation.

NF‑CoT (Unified) achieves the best average pass@1, raising Qwen3‑8B‑Base from 55.8 % to 68.8 % (+13.0 %).

Table 1 shows NF‑CoT (Unified) leading across all five code benchmarks, with the largest margin over the strongest open‑source baseline.

**Figure 3.** Pass@k scaling on MBPP+ and HumanEval+. NF-CoT outperforms the base model and LaDiR, and continues to improve with larger k.

**Figure 4.** Pass@k diversity before and after reinforcement learning on MBPP+ and HumanEval+. Top row: standard token-space GRPO improves the low-k region but saturates at larger k, failing to provide the same large-sample coverage as the base model. Bottom row: latent-space RL for NF-CoT improves pass@1 while preserving the upward pass@k scaling trend, indicating that policy-gradient refinement in continuous-CoT space does not collapse the latent trajectory distribution.

**Figure 5.** Effect of continuous-CoT perturbation strength on HumanEval generation. As $\sigma$ increases, the perturbed continuous-CoT trajectory becomes nearly orthogonal to the original trajectory, and exact-text match drops sharply. Nevertheless, pass@1 remains nearly flat, indicating that local perturbations change the form of the generated program much more than its functional correctness, even at the largest perturbation strengths tested.

**Figure 6.** Pairwise output similarity among passing HumanEval programs. Each panel contains 32 = 4 × 8 passing samples from four HumanEval problems, with intra-task blocks on the diagonal and cross-task entries set to zero to isolate within-prompt structure. Similarity is computed using a structure-aware AST metric; lower values indicate higher diversity. NF-CoT achieves lower mean intra-prompt similarity than the Qwen3-8B-Base teacher, indicating more structurally diverse passing solutions under the same answer temperature.

NF‑CoT outperforms discrete CoT and LaDiR on code benchmarks.

Related Work and Reasoning Paradigms

NF‑CoT swaps discrete token steps for continuous latent states generated by a normalizing flow, removing the serial bottleneck of standard chain‑of‑thought.

Standard chain‑of‑thought (CoT) forces the model to emit each reasoning step as a separate token, creating a strict left‑to‑right bottleneck. NF‑CoT replaces that sequence with continuous latent states produced by a normalizing flow, allowing richer, non‑serial intermediate computation.

Discrete CoT generates a textual reasoning chain, where each intermediate thought is a token that the model must process in order.

LaDiR is a latent‑reasoning baseline that decodes a fixed latent vector into a chain of thoughts, incurring extra computation compared to token‑based CoT.

We now situate NF‑CoT among prior approaches to chain‑of‑thought and latent reasoning.

Prompting technique that elicits intermediate reasoning steps as text before producing the final answer.

Approach that feeds continuous latent vectors back into the model instead of discrete tokens, aiming to capture richer intermediate representations.

Uses the Gumbel‑Softmax trick to turn discrete token choices into differentiable soft embeddings, enabling stochastic latent reasoning.

Variational auto‑encoder framework that learns a latent distribution; some works denoise this space to guide reasoning.

Invertible flow models that map a simple base distribution to sequences, preserving the causal order of Transformers.

Caveats and Failure Modes

We enumerate prior works that inform the study’s limitations.

Key prior works include Jacob Austin et al. (2021) on program synthesis, Natasha Butt et al. (2025) on soft tokens, Mark Chen et al. (2021) on code evaluation, Jingcheng Deng et al. (2026) on latent reasoning, Laurent Dinh et al. (2014) on NICE, Laurent Dinh et al. (2016) on Real NVP, Tianyu Fu et al. (2025) on selective latent iterations, Jonas Geiping et al. (2026) on scaling test‑time compute, Shansan Gong et al. (2025) on Diffucoder, Jiatao Gu et al. (2025) on StarFlow‑V, Jiatao Gu et al. (2026a) on high‑resolution image synthesis, Jiatao Gu et al. (2026b) on normalizing trajectory models, Daya Guo et al. (2025) on DeepSeek‑R1, Shibo Hao et al. (2024) on continuous latent reasoning, Jonathan Ho et al. (2020) on diffusion models, Siming Huang et al. (2025) on OpEncoder, Hugging Face (2025) on Open R1, Binyuan Hui et al. (2024) on Qwen2.5‑Coder, Naman Jain et al. (2025) on LiveCodeBench, Eric Jang et al. (2016) on Gumbel‑Softmax, Haoqiang Kang et al. (2025) on LaDiR, Haoqiang Kang et al. (2026) on diversity‑preserving RL, Durk P Kingma & Prafulla Dhariwal (2018) on Glow, Durk P Kingma et al. (2016) on IAF, Takeshi Kojima et al. (2022) on zero‑shot reasoning, Jiawei Liu et al. (2023) on code correctness, Xuezhe Ma et al. (2019) on FlowSeq, Chris J Maddison et al. (2016) on the Concrete distribution, Shen Nie et al. (2026) on large language diffusion, Maxwell Nye et al. (2021) on Scratchpads, George Papamakarios et al. (2017) on masked autoregressive flow, Danilo Rezende & Shakir Mohamed (2015) on variational inference, ByteDance Seed et al. (2025) on Seed‑Coder, Zhihong Shao et al. (2024) on DeepSeekMath, Ying Shen et al. (2026) on StarFlow2, Charlie Snell et al. (2024) on test‑time compute scaling, DiJia Su et al. (2025) on token assorted, Yao Tang et al. (2026) on multiplex thinking, Ling Team et al. (2025) on mixture‑of‑experts, Qixun Wang et al. (2026) on Monet, Xuezhi Wang et al. (2022) on self‑consistency, Jason Wei et al. (2022) on chain‑of‑thought prompting, Junhong Wu et al. (2025) on single‑threaded reasoning, Zhihui Xie et al. (2025) on Dream‑Coder, Zhangchen Xu et al. (2025) on Kodcode, An Yang et al. (2025) on Qwen3, Shunyu Yao et al. (2023) on Tree of Thoughts, Jiacheng Ye et al. (2025) on Dream 7B, Eric Zelikman et al. (2022) on STAR, Huaye Zeng et al. (2025) on AceCoder, Shuangfei Zhai et al. (2024) on normalizing flows capability, Ruixiang Zhang et al. (2026a) on transformer‑based autoregressive flows, Zhen Zhang et al. (2026) on soft thinking, Siyan Zhao et al. (2026) on D1, Yuyan Zhou et al. (2026) on Lepo, Rui‑Jie Zhu et al. (2025) on scaling latent reasoning, and Zachary Ziegler & Alexander Rush (2019) on latent normalizing flows.

Qualitative Reasoning Examples

Ablation study comparing dual‑path and unified‑path NF‑CoT variants.

We compare two concrete design choices for NF‑CoT: a dual‑path variant that runs the backbone twice per step, and a unified‑path variant that collapses both passes into a single causal sequence.

The model runs the LLM twice per gradient step, keeping separate projectors and boundary tokens for the flow and the cross‑entropy (CE) paths.

Forward 1: backbone → hidden₁; `flow_projector`(hidden₁) → latent flow states; NF NLL computed.

Forward 2: backbone (same weights) → hidden₂; vae.`latent_to_decoder`(hidden₂) → latent CE states; CE cross‑entropy computed.

Gradients from both losses are summed and back‑propagated through the shared backbone.

The second forward doubles compute but isolates the two losses, allowing independent tuning of the flow and CE projectors.

Why not share a single projector for both paths?

Sharing would couple the NF density estimation and the CE prediction, preventing the model from learning distinct representations needed for accurate likelihood modeling versus token prediction. The dual‑path design keeps these objectives orthogonal while still reusing the backbone.

The model merges the two passes into one causal sequence, eliminating the second backbone forward.

Positions

Both losses are summed; gradients flow back through the single backbone forward.

By sharing the backbone forward the model cuts compute roughly in half while preserving the representational capacity of the flow.

Does merging the paths hurt performance because the same hidden states serve two different losses?

The paper reports that the unified‑path variant matches the dual‑path accuracy while halving inference time, indicating that the invertible MetaBlocks successfully decouple the two objectives despite sharing hidden states.

Both variants use the same backbone (Qwen3‑8B‑Base), latent geometry ($N=64$, $D=2560$), and two‑stage curriculum, differing only in the conditioning sequence and which projectors are trainable in Stage 1.

Training hyperparameters include a frozen VAE encoder, dequantization noise $\sigma_{dq}=0.3$, and a flow‑density head built from five invertible MetaBlocks with channel width 2048 and head dimension 64.

Inference proceeds in two phases: Phase 1 samples latent noise $z\sim\mathcal{N}(0,T^{2})$, runs the NF reverse pass with KV‑cache reuse, and Phase 2 feeds the resulting latent prefix to the backbone via vLLM for answer decoding.

Execution‑guided RL fine‑tunes only the shared backbone while keeping all flow components frozen, using a combined token‑level and latent‑level PPO objective over 150 RL steps.

`CODEBLOCK_0`

Experimental Configuration and Drift

Experimental details for curriculum variants and output‑diversity measurements.

We evaluate two training curricula—(i) the default two‑stage warm‑up and (ii) a stage‑2‑only run that begins joint training from random flow parameters. We also measure how NF‑CoT affects the structural diversity of generated programs.

Backbone drift quantifies how far the pretrained language‑model parameters move during joint finetuning with the NF branch.

Early training dynamics also differ: the warm‑started run begins stage 2 with $\text{LNF}\approx-0.42$ and log‑determinant $\approx-0.92$, whereas the stage‑2‑only run starts with $\text{LNF}\approx0.47$ and log‑determinant near 0, yielding an initial gradient norm of $1.96$ versus $0.96$ for the warm‑started model.

Table 6 summarizes the hyperparameters used for the dual‑path and unified‑path NF models; most settings (optimizer, learning‑rate, precision, random seed) are shared, while a few (prompt dropout, prefix markers) differ between the two paths.

For output diversity we compute a structure‑aware similarity

Questions & answers

What is the main contribution of NF-CoT?

NF-CoT introduces a latent reasoning framework that places a normalizing flow head directly inside an LLM backbone, allowing the model to generate and score continuous latent reasoning trajectories in a single causal pass before decoding the final answer, rather than verbalizing every intermediate step as discrete tokens.

What problem does NF-CoT address?

NF-CoT addresses the low-bandwidth, serial bottleneck of explicit chain-of-thought reasoning, where every intermediate reasoning step must be verbalized as a token before the model can continue, making long and complex reasoning computationally expensive and tying intermediate computation to surface text.

Why does NF-CoT use normalizing flows instead of diffusion models?

Diffusion models require iterative denoising steps, which are computationally expensive and lack a native left-to-right likelihood interface. Normalizing flows provide exact likelihoods and allow efficient, single-pass autoregressive sampling that integrates seamlessly with the LLM's existing KV cache.

How does NF-CoT work at a technical level?

NF-CoT reparameterizes latent thoughts into a space that supports exact likelihood estimation and left-to-right autoregressive sampling using a flow-density head built from five invertible MetaBlocks. Inference proceeds in two phases: Phase 1 samples latent noise, runs the normalizing flow reverse pass with KV-cache reuse, and Phase 2 feeds the resulting latent prefix to the backbone via vLLM for answer decoding.

What backbone model and latent geometry does NF-CoT use?

Both the dual-path and unified-path variants use Qwen3-8B-Base as the backbone, with a latent geometry of N=64 and D=2560, and a flow-density head built from five invertible MetaBlocks with channel width 2048 and head dimension 64.

What are the dual-path and unified-path variants of NF-CoT?

The dual-path variant runs the backbone twice per step to keep normalizing flow density estimation and cross-entropy token prediction objectives orthogonal, while the unified-path variant collapses both passes into a single causal sequence using invertible MetaBlocks to decouple the two objectives despite sharing hidden states.

Does merging the dual-path into a unified-path hurt performance?

The paper reports that the unified-path variant matches the dual-path accuracy while halving inference time, indicating that the invertible MetaBlocks successfully decouple the two objectives despite sharing hidden states.

What training procedure does NF-CoT use?

NF-CoT uses a two-stage curriculum: Stage 1 involves warm-up training with a frozen VAE encoder and dequantization noise σ_dq=0.3, and Stage 2 involves joint training. Execution-guided reinforcement learning then fine-tunes only the shared backbone while keeping all flow components frozen, using a combined token-level and latent-level PPO objective over 150 RL steps.

What happens if the two-stage warm-up curriculum is skipped?

A stage-2-only run that begins joint training from random flow parameters starts with LNF≈0.47 and log-determinant near 0, yielding an initial gradient norm of 1.96, compared to LNF≈-0.42 and gradient norm of 0.96 for the warm-started model, indicating less stable early training dynamics without the warm-up.

What benchmarks and datasets are used to evaluate NF-CoT?

NF-CoT is evaluated on code generation benchmarks using pass rates as the primary metric. The paper does not specify the exact benchmark names beyond referencing code generation tasks and citing related work such as Mark Chen et al. (2021) on code evaluation.

What are the key empirical results of NF-CoT?

NF-CoT outperforms discrete chain-of-thought and LaDiR on code generation benchmarks in terms of pass rates, while significantly reducing the compute cost of intermediate reasoning steps. The paper does not report specific numerical pass-rate figures in the provided text.

Does NF-CoT produce human-readable reasoning traces?

No. The continuous latent states are not human-readable; the authors treat decoded latent chain-of-thought outputs only as qualitative probes rather than faithful natural-language explanations of the model's internal reasoning.

How does NF-CoT compare to prior latent reasoning approaches such as LaDiR?

NF-CoT outperforms LaDiR on code generation benchmarks. Unlike diffusion-based latent reasoning approaches, NF-CoT uses normalizing flows that provide exact likelihoods and single-pass autoregressive sampling, avoiding the iterative denoising steps that make diffusion-based methods computationally expensive.

Why does NF-CoT use separate projectors for the two paths rather than a shared one?

Sharing a single projector would couple the normalizing flow density estimation and the cross-entropy prediction objectives, preventing the model from learning distinct representations needed for accurate likelihood modeling versus token prediction. The dual-path design keeps these objectives orthogonal while still reusing the backbone.

What are the limitations of NF-CoT?

The continuous latent states are not human-readable, so the method does not provide interpretable reasoning traces. The paper also identifies caveats and failure modes as a dedicated section topic, though specific failure cases are not detailed in the provided text.

What related work does NF-CoT build upon?

Key prior works cited include Jacob Austin et al. (2021) on program synthesis, Natasha Butt et al. (2025) on soft tokens, Mark Chen et al. (2021) on code evaluation, Jingcheng Deng (2026) on latent reasoning, Laurent Dinh et al. (2014) on NICE and (2016) on Real NVP, Tianyu Fu et al. (2025) on selective latent iterations, and Jonas Geiping et al. (2026) on scaling test-time compute.

Who are the authors of NF-CoT and where was it published?

The paper does not explicitly state the author names or publication venue in the provided text; it is available at arxiv.org/abs/2606.06447.

Key terms

NF-CoT
The proposed method (Normalizing Flow Chain-of-Thought) that replaces discrete reasoning tokens with continuous latent states modeled by autoregressive normalizing flows inside an LLM backbone.
normalizing flow
A type of generative model that transforms a simple probability distribution (e.g., Gaussian) into a complex one through a sequence of invertible, differentiable mappings, enabling exact likelihood computation.
chain-of-thought (CoT)
A reasoning technique where a language model verbalizes each intermediate reasoning step as text tokens before producing a final answer.
latent reasoning
An approach to model reasoning where intermediate computation occurs in a continuous, non-token vector space rather than as discrete natural language tokens.
MetaBlock
An invertible building block used in NF-CoT's flow-density head that enables the normalizing flow transformations while allowing the unified-path variant to decouple density estimation and token prediction objectives.
dual-path variant
An NF-CoT architecture design that runs the backbone twice per reasoning step to keep the normalizing flow density estimation and cross-entropy token prediction objectives separate.
unified-path variant
An NF-CoT architecture design that collapses the dual-path's two backbone passes into a single causal sequence, halving inference time while maintaining accuracy.
KV cache
A mechanism in transformer-based language models that stores previously computed key and value attention matrices to avoid redundant computation during sequential generation.
LaDiR
A prior latent reasoning approach that uses diffusion models for intermediate reasoning, which NF-CoT is compared against and outperforms on code generation benchmarks.
execution-guided RL
A reinforcement learning fine-tuning stage in NF-CoT that uses program execution outcomes as reward signals to further train the backbone while keeping flow components frozen.
PPO (Proximal Policy Optimization)
A reinforcement learning algorithm used in NF-CoT's fine-tuning stage, applied at both the token level and latent level over 150 RL steps.
dequantization noise
A small amount of noise (σ_dq=0.3 in NF-CoT) added to discrete data during training to convert it into a continuous distribution suitable for normalizing flow modeling.
two-stage curriculum
NF-CoT's training schedule consisting of a warm-up Stage 1 with a frozen VAE encoder followed by a joint training Stage 2, designed to stabilize early training dynamics.
VAE encoder
A variational autoencoder encoder component used in NF-CoT's Stage 1 training that is kept frozen to provide stable latent representations during warm-up.
pass rate
A code generation evaluation metric measuring the fraction of generated programs that successfully pass a set of test cases.
Real NVP
A normalizing flow architecture introduced by Laurent Dinh et al. (2016) that uses coupling layers for efficient invertible transformations, cited as foundational prior work for NF-CoT.
NICE
An early normalizing flow model introduced by Laurent Dinh et al. (2014) based on additive coupling layers, cited as foundational prior work for NF-CoT.

Read the original paper

Open the simplified reader on Paperglide