Diffusion Thinking, Autoregressive Speaking

I keep running into the same limitation with autoregressive language models.

They commit too early.¹

They can revise only by adding more tokens. So “thinking” becomes a one-way street.

This post is a working theory.

The core idea is simple.

Let the model think in a space where revision is cheap, then speak in tokens.

/motivation

In image generation, diffusion is good at global structure.

It can explore. It can revise. It can converge.

Autoregressive refiners are good at local detail.

They sharpen. They lock in. They make things crisp.

My diffusion-AR hybrid paper was an attempt to separate these roles.

Not to claim a new SOTA. Just to test a separation principle.

Coarse structure first. Fine emission last.

I want the same separation for language.

/translation

For language, “global structure” is not pixels.

It is plan structure.

Examples of what I mean by “plan”:

constraints
subgoals
invariants
proof skeleton
tool call sequence
outline of steps with dependencies

The output tokens are the “local detail”.

So the mapping becomes:

diffusion stage: refine a plan state
AR stage: generate the final answer tokens conditioned on the plan

Think: revise. Speak: commit.

/why diffusion here

Diffusion is not just an image trick.

It is a procedure:

Start from noise. Iteratively denoise toward a coherent state. Make many small corrections instead of one big commitment.

Language reasoning wants that.

Because reasoning is mostly editing.

You try a structure. You find a contradiction. You patch it. You simplify.

AR decoding does not like patching. It likes momentum.

/design space

There are at least three ways to make “diffusion thinking” concrete.

1) latent diffusion over a continuous plan

Represent the plan as a continuous latent sequence z.

Train a diffusion model to denoise z_T -> z_0. Then decode text autoregressively conditioned on z_0.

Pros:

closest to real diffusion math
cheap to iterate in latent space

Cons:

you must learn a plan latent that is actually useful
alignment between z and text is non-trivial

2) discrete diffusion over plan tokens

Represent the plan as a sequence of discrete “plan tokens”.

Corrupt it by masking or shuffling. Iteratively restore it (denoise). Then AR decode the final answer conditioned on the restored plan.

Pros:

stays in language-like space
“revision” becomes first-class

Cons:

plan tokenization matters a lot
it can collapse into fluffy scratchpads

Use a plan state. Propose edits stochastically. Score them with a verifier signal. Accept or reject. Repeat.

This is not “true diffusion”. But it has the same spirit: iterative state refinement.

Pros:

can be done at inference time
can reuse existing LMs + verifiers

Cons:

the quality of the score signal becomes the bottleneck
can become expensive without smart proposals

This third option is the easiest to prototype.

It is also the most honest starting point.

/the minimal prototype

If I were building a first version, I would keep it boring.

Choose a plan format:

constraints + subgoals in bullet form
short, structured, no prose

Initialize a plan:

one pass from the LM

Run K refinement steps:

propose a small edit (replace one constraint, reorder steps, tighten an invariant)
score the new plan (verifier, PRM, self-check, tool-based check)
accept or reject
keep the best plan seen

Autoregressively decode the final answer:

conditioned on the final plan
optionally enforce “faithfulness” constraints (answer must cite which plan items it satisfies)

This already creates a real separation.

The plan can converge. The answer can stay short.

/what would count as a win

This idea is not interesting unless it changes measurable behavior.

I would test:

fixed budget accuracy vs AR-only at the same token cost
contradiction rate in long solutions
plan faithfulness:
does the final answer actually satisfy the plan constraints
stability:
does refinement reduce variance across seeds

Target tasks:

long-horizon math
multi-step tool use
code generation with global constraints

/failure modes

I expect the main failures to be boring.

plan becomes fluff

The plan reads nice. It does not constrain anything.

plan leaks the answer

The plan just becomes chain-of-thought with extra steps.

Then the AR stage is pointless.

exposure mismatch

The AR decoder is conditioned on plans it never saw during training.

So it ignores them.

scoring signal is weak

If the verifier signal is noisy, refinement becomes random walk.

Then the “diffusion thinking” stage does not converge.

/why I still like it

Reasoning is state estimation.

AR decoding is state emission.

Right now we conflate them.

This proposal separates them.

It makes “revision” a native operation. Then makes “commitment” a final step.

That separation feels like leverage.

Even if the first prototypes are ugly.

/status

This is not a result.

It is a design sketch.

If it fails, it should fail in an interpretable way.

That is the point.

Next step is a small prototype with:

a structured plan state
a cheap verifier signal
a refinement loop
a compute-matched baseline

Then the idea can be kept or killed.

Autoregressive sampling commits each token before observing the rest of the chain — once placed, a token can only be revised by appending more tokens, never by editing what came before. ↩

Table of Contents

Rituraj

Diffusion Thinking, Autoregressive Speaking

/motivation

/translation

/why diffusion here

/design space

1) latent diffusion over a continuous plan

2) discrete diffusion over plan tokens

3) diffusion-style refinement without training a diffusion model

/the minimal prototype

/what would count as a win

/failure modes

plan becomes fluff

plan leaks the answer

exposure mismatch

scoring signal is weak

/why I still like it

/status

Table of Contents

Diffusion Thinking, Autoregressive Speaking

/motivation

/translation

/why diffusion here

/design space

1) latent diffusion over a continuous plan

2) discrete diffusion over plan tokens

3) diffusion-style refinement without training a diffusion model

/the minimal prototype

/what would count as a win

/failure modes

plan becomes fluff

plan leaks the answer

exposure mismatch

scoring signal is weak

/why I still like it

/status

Footnotes