I keep running into the same limitation with autoregressive language models.
They commit too early.
They can revise only by adding more tokens. So “thinking” becomes a one-way street.
This post is a working theory.
The core idea is simple.
Let the model think in a space where revision is cheap, then speak in tokens.
/motivation
In image generation, diffusion is good at global structure.
It can explore. It can revise. It can converge.
Autoregressive refiners are good at local detail.
They sharpen. They lock in. They make things crisp.
My diffusion-AR hybrid paper was an attempt to separate these roles.
Not to claim a new SOTA. Just to test a separation principle.
Coarse structure first. Fine emission last.
I want the same separation for language.
/translation
For language, “global structure” is not pixels.
It is plan structure.
Examples of what I mean by “plan”:
- constraints
- subgoals
- invariants
- proof skeleton
- tool call sequence
- outline of steps with dependencies
The output tokens are the “local detail”.
So the mapping becomes:
- diffusion stage: refine a plan state
- AR stage: generate the final answer tokens conditioned on the plan
Think: revise. Speak: commit.
/why diffusion here
Diffusion is not just an image trick.
It is a procedure:
Start from noise. Iteratively denoise toward a coherent state. Make many small corrections instead of one big commitment.
Language reasoning wants that.
Because reasoning is mostly editing.
You try a structure. You find a contradiction. You patch it. You simplify.
AR decoding does not like patching. It likes momentum.
/design space
There are at least three ways to make “diffusion thinking” concrete.
1) latent diffusion over a continuous plan
Represent the plan as a continuous latent sequence z.
Train a diffusion model to denoise z_T -> z_0.
Then decode text autoregressively conditioned on z_0.
Pros:
- closest to real diffusion math
- cheap to iterate in latent space
Cons:
- you must learn a plan latent that is actually useful
- alignment between
zand text is non-trivial
2) discrete diffusion over plan tokens
Represent the plan as a sequence of discrete “plan tokens”.
Corrupt it by masking or shuffling. Iteratively restore it (denoise). Then AR decode the final answer conditioned on the restored plan.
Pros:
- stays in language-like space
- “revision” becomes first-class
Cons:
- plan tokenization matters a lot
- it can collapse into fluffy scratchpads
3) diffusion-style refinement without training a diffusion model
Use a plan state. Propose edits stochastically. Score them with a verifier signal. Accept or reject. Repeat.
This is not “true diffusion”. But it has the same spirit: iterative state refinement.
Pros:
- can be done at inference time
- can reuse existing LMs + verifiers
Cons:
- the quality of the score signal becomes the bottleneck
- can become expensive without smart proposals
This third option is the easiest to prototype.
It is also the most honest starting point.
/the minimal prototype
If I were building a first version, I would keep it boring.
- Choose a plan format:
- constraints + subgoals in bullet form
- short, structured, no prose
- Initialize a plan:
- one pass from the LM
- Run K refinement steps:
- propose a small edit (replace one constraint, reorder steps, tighten an invariant)
- score the new plan (verifier, PRM, self-check, tool-based check)
- accept or reject
- keep the best plan seen
- Autoregressively decode the final answer:
- conditioned on the final plan
- optionally enforce “faithfulness” constraints (answer must cite which plan items it satisfies)
This already creates a real separation.
The plan can converge. The answer can stay short.
/what would count as a win
This idea is not interesting unless it changes measurable behavior.
I would test:
- fixed budget accuracy vs AR-only at the same token cost
- contradiction rate in long solutions
- plan faithfulness:
- does the final answer actually satisfy the plan constraints
- stability:
- does refinement reduce variance across seeds
Target tasks:
- long-horizon math
- multi-step tool use
- code generation with global constraints
/failure modes
I expect the main failures to be boring.
plan becomes fluff
The plan reads nice. It does not constrain anything.
plan leaks the answer
The plan just becomes chain-of-thought with extra steps.
Then the AR stage is pointless.
exposure mismatch
The AR decoder is conditioned on plans it never saw during training.
So it ignores them.
scoring signal is weak
If the verifier signal is noisy, refinement becomes random walk.
Then the “diffusion thinking” stage does not converge.
/why I still like it
Reasoning is state estimation.
AR decoding is state emission.
Right now we conflate them.
This proposal separates them.
It makes “revision” a native operation. Then makes “commitment” a final step.
That separation feels like leverage.
Even if the first prototypes are ugly.
/status
This is not a result.
It is a design sketch.
If it fails, it should fail in an interpretable way.
That is the point.
Next step is a small prototype with:
- a structured plan state
- a cheap verifier signal
- a refinement loop
- a compute-matched baseline
Then the idea can be kept or killed.