TL;DR

PRISM uses step-level verification (a PRM signal) to guide population refinement and solution aggregation in Deep Think systems, treating candidates like particles in an energy landscape.

Five-line strategy summary

  • Problem: refinement often amplifies errors without a correctness signal (population-enhancement bottleneck).
  • Idea: inject step-level correctness into inference, not training.
  • Mechanism: score resample when diversity collapses stochastic refinement with accept/reject.
  • Aggregation: PRM-score voting instead of frequency-only voting.
  • Outcome: strong results on AIME25/HMMT25/GPQA Diamond under gpt-oss-20b.

Results (headline)

AIME25: 90.0% • HMMT25: 75.4% • GPQA Diamond: 71.4% (gpt-oss-20b + PRISM).

Citation (BibTeX)

@misc{sharma2026prismpushingfrontierdeep,
  title={PRISM: Pushing the Frontier of Deep Think via Process Reward Model-Guided Inference},
  author={Rituraj Sharma and Weiyuan Chen and Noah Provenzano and Tu Vu},
  year={2026},
  eprint={2603.02479},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2603.02479},
}