TR2026-094
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
-
- , "Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment", International Conference on Machine Learning (ICML) Workshop on Agents in the Wild: Safety, Security, and Beyond, July 2026.BibTeX TR2026-094 PDF
- @inproceedings{Wang2026jul,
- author = {Wang, Ye and Liu, Jing and Koike-Akino, Toshiaki},
- title = {{Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment}},
- booktitle = {International Conference on Machine Learning (ICML) Workshop on Agents in the Wild: Safety, Security, and Beyond},
- year = 2026,
- month = jul,
- url = {https://www.merl.com/publications/TR2026-094}
- }
- , "Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment", International Conference on Machine Learning (ICML) Workshop on Agents in the Wild: Safety, Security, and Beyond, July 2026.
-
MERL Contacts:
-
Research Areas:
Abstract:
Inference-time alignment techniques offer a lightweight alternative or complement to costly reinforcement learning, while enabling continual adaptation as alignment objectives and reward targets evolve. Existing theoretical analyses justify these methods as approximations to sampling from distributions optimally tilted toward a given reward model. We extend these techniques by introducing reference-model temperature adjustment, which leads to further generalization of inference-time alignment to ensembles of generative and reward models combined as a sharpened logarithmic opinion pool (SLOP). To ad- dress reward hacking, we propose an algorithm for calibrating SLOP weight parameters and experimentally demonstrate that it improves robustness while preserving alignment performance.


