I keep running into the same limitation with autoregressive language models.

They commit too early.

They can revise only by adding more tokens. So “thinking” becomes a one-way street.

This post is a working theory.

The core idea is simple.

Let the model think in a space where revision is cheap, then speak in tokens.

/motivation

In image generation, diffusion is good at global structure.

It can explore. It can revise. It can converge.

Autoregressive refiners are good at local detail.

They sharpen. They lock in. They make things crisp.

My diffusion-AR hybrid paper was an attempt to separate these roles.

Not to claim a new SOTA. Just to test a separation principle.

Coarse structure first. Fine emission last.

I want the same separation for language.

/translation

For language, “global structure” is not pixels.

It is plan structure.

Examples of what I mean by “plan”:

  • constraints
  • subgoals
  • invariants
  • proof skeleton
  • tool call sequence
  • outline of steps with dependencies

The output tokens are the “local detail”.

So the mapping becomes:

  • diffusion stage: refine a plan state
  • AR stage: generate the final answer tokens conditioned on the plan

Think: revise. Speak: commit.

/why diffusion here

Diffusion is not just an image trick.

It is a procedure:

Start from noise. Iteratively denoise toward a coherent state. Make many small corrections instead of one big commitment.

Language reasoning wants that.

Because reasoning is mostly editing.

You try a structure. You find a contradiction. You patch it. You simplify.

AR decoding does not like patching. It likes momentum.

/design space

There are at least three ways to make “diffusion thinking” concrete.

1) latent diffusion over a continuous plan

Represent the plan as a continuous latent sequence z.

Train a diffusion model to denoise z_T -> z_0. Then decode text autoregressively conditioned on z_0.

Pros:

  • closest to real diffusion math
  • cheap to iterate in latent space

Cons:

  • you must learn a plan latent that is actually useful
  • alignment between z and text is non-trivial

2) discrete diffusion over plan tokens

Represent the plan as a sequence of discrete “plan tokens”.

Corrupt it by masking or shuffling. Iteratively restore it (denoise). Then AR decode the final answer conditioned on the restored plan.

Pros:

  • stays in language-like space
  • “revision” becomes first-class

Cons:

  • plan tokenization matters a lot
  • it can collapse into fluffy scratchpads

3) diffusion-style refinement without training a diffusion model

Use a plan state. Propose edits stochastically. Score them with a verifier signal. Accept or reject. Repeat.

This is not “true diffusion”. But it has the same spirit: iterative state refinement.

Pros:

  • can be done at inference time
  • can reuse existing LMs + verifiers

Cons:

  • the quality of the score signal becomes the bottleneck
  • can become expensive without smart proposals

This third option is the easiest to prototype.

It is also the most honest starting point.

/the minimal prototype

If I were building a first version, I would keep it boring.

  1. Choose a plan format:
  • constraints + subgoals in bullet form
  • short, structured, no prose
  1. Initialize a plan:
  • one pass from the LM
  1. Run K refinement steps:
  • propose a small edit (replace one constraint, reorder steps, tighten an invariant)
  • score the new plan (verifier, PRM, self-check, tool-based check)
  • accept or reject
  • keep the best plan seen
  1. Autoregressively decode the final answer:
  • conditioned on the final plan
  • optionally enforce “faithfulness” constraints (answer must cite which plan items it satisfies)

This already creates a real separation.

The plan can converge. The answer can stay short.

/what would count as a win

This idea is not interesting unless it changes measurable behavior.

I would test:

  • fixed budget accuracy vs AR-only at the same token cost
  • contradiction rate in long solutions
  • plan faithfulness:
  • does the final answer actually satisfy the plan constraints
  • stability:
  • does refinement reduce variance across seeds

Target tasks:

  • long-horizon math
  • multi-step tool use
  • code generation with global constraints

/failure modes

I expect the main failures to be boring.

plan becomes fluff

The plan reads nice. It does not constrain anything.

plan leaks the answer

The plan just becomes chain-of-thought with extra steps.

Then the AR stage is pointless.

exposure mismatch

The AR decoder is conditioned on plans it never saw during training.

So it ignores them.

scoring signal is weak

If the verifier signal is noisy, refinement becomes random walk.

Then the “diffusion thinking” stage does not converge.

/why I still like it

Reasoning is state estimation.

AR decoding is state emission.

Right now we conflate them.

This proposal separates them.

It makes “revision” a native operation. Then makes “commitment” a final step.

That separation feels like leverage.

Even if the first prototypes are ugly.

/status

This is not a result.

It is a design sketch.

If it fails, it should fail in an interpretable way.

That is the point.

Next step is a small prototype with:

  • a structured plan state
  • a cheap verifier signal
  • a refinement loop
  • a compute-matched baseline

Then the idea can be kept or killed.