为什么 RM 总是学不好推理?揭秘 BT Loss 中被忽视的“距离偏差”

论文标题:WHEN DISTANCE DISTRACTS: REPRESENTATION DIS-TANCE BIAS IN BT-LOSS FOR R ...

QwenLong-L1.5:长上下文推理与记忆管理的后训练方案

论文标题:QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memo ...

Motif-2-12.7B-Reasoning:RL 训练配方与全栈优化实践指南

论文标题:Motif-2-12.7B-Reasoning: A Practitioner’s Guide to RL Training Recipes ...

NVIDIA 提出端到端 RL 编排,8B 模型在 HLE 榜单超越 GPT-5

论文标题:ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orch ...

LayerNorm 真的不可或缺吗?一文读懂超越归一化层的 Derf

论文标题:Stronger Normalization-Free Transformers 论文链接:https://arxiv.org/pdf/2 ...

UBC & DeepMind 揭示“短上下文主导”现象:80%的生成任务只需最后96个Token

论文标题:SHORT-CONTEXT DOMINANCE: HOW MUCH LOCAL CONTEXT NATURAL LANGUAGE ACTUAL ...

谷歌 DeepMind & MIT 发布智能体 Scaling Law

论文标题:Towards a Science of Scaling Agent Systems 论文链接:https://arxiv.org/pdf ...

Native Parallel Reasoner: 基于自蒸馏强化学习的原生并行推理框架

论文标题:Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled R ...

预训练、中期训练与强化学习在推理模型中的相互作用

论文标题:On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Lan ...

Differential Smoothing——缓解 RL 微调中的分布坍缩并提升 LLM 推理能力

论文标题:Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning ...