大模型154

谷歌 DeepMind & MIT 发布智能体 Scaling Law

论文标题:Towards a Science of Scaling Agent Systems 论文链接:https://arxiv.org/pdf ...

Native Parallel Reasoner: 基于自蒸馏强化学习的原生并行推理框架

论文标题:Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled R ...

预训练、中期训练与强化学习在推理模型中的相互作用

论文标题:On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Lan ...

Differential Smoothing——缓解 RL 微调中的分布坍缩并提升 LLM 推理能力

论文标题:Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning ...

Natural Language Actor-Critic: 语言空间中的可扩展异策略学习 (NLAC)

论文标题:Natural Language Actor-Critic: SCALABLE OFF-POLICY LEARNING IN LANGUAGE ...

AAAI 2026:DeltaEdit 实现 LLM 连续知识编辑

论文标题:On the Superimposed Noise Accumulation Problem in Sequential Knowledge ...

复现 Search-R1 总是失败?GRPO 训练不稳定的幕后真凶与对策

论文标题:On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death S ...

PretrainZero:将强化学习前置到预训练阶段的主动学习框架

论文标题:PretrainZero: Reinforcement Active Pretraining 论文链接:https://arxiv.org ...

LLM-as-a-Judge 评估中的偏差修正与置信区间构建

论文标题:How to Correctly Report LLM-as-a-Judge Evaluations 论文链接:https://arxiv ...

Qwen 推出 MiniRL:关于大规模 RL 训练稳定性的研究和实践

论文标题:Stabilizing Reinforcement Learning with LLMs: Formulation and Practices ...