06
2025/12
复现 Search-R1 总是失败?GRPO 训练不稳定的幕后真凶与对策
论文标题:On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death S
...