TL;DR
PRISM uses step-level verification (a PRM signal) to guide population refinement and solution aggregation in Deep Think systems, treating candidates like particles in an energy landscape.
Links
Five-line strategy summary
- Problem: refinement often amplifies errors without a correctness signal (population-enhancement bottleneck).
- Idea: inject step-level correctness into inference, not training.
- Mechanism: score → resample when diversity collapses → stochastic refinement with accept/reject.
- Aggregation: PRM-score voting instead of frequency-only voting.
- Outcome: strong results on AIME25/HMMT25/GPQA Diamond under gpt-oss-20b.
Results (headline)
AIME25: 90.0% • HMMT25: 75.4% • GPQA Diamond: 71.4% (gpt-oss-20b + PRISM).
Citation (BibTeX)
@misc{sharma2026prismpushingfrontierdeep,
title={PRISM: Pushing the Frontier of Deep Think via Process Reward Model-Guided Inference},
author={Rituraj Sharma and Weiyuan Chen and Noah Provenzano and Tu Vu},
year={2026},
eprint={2603.02479},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2603.02479},
}