美团提出 DynaMO:面向RLVR的动态Rollout分配与优势调节策略
论文标题:How to Allocate, How to Learn? Dynamic Rollout Allocation and Advantage
...
你的推理模型其实知道何时停止:解决 Long CoT 中的“过度思考”
论文标题:Does Your Reasoning Model Implicitly Know When to Stop Thinking?
论文链接
...
小红书提出 VESPO 变分序列级软策略优化,从测度变换视角重构重要性采样
论文标题:VESPO: Variational Sequence-Level Soft Policy Optimization for Stable O
...
Qwen 团队新作:统一视角解读 Transformer 中的 Attention 与 Residual Sinks
论文标题:A Unified View of Attention and Residual Sinks: Outlier-Driven Rescalin
...
Less is Enough——在大模型特征空间中合成多样化数据
论文标题:Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs
论文
...
腾讯混元提出 G-OPD:超越教师模型的广义在线蒸馏与奖励外推
论文标题:Learning beyond Teacher: Generalized On-Policy Distillation with Reward
...
腾讯混元提出 Composition-RL:通过合成可验证Prompt提升大模型强化学习效率
论文标题:Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learn
...
上交 & 千问提出 OPUS:论大模型预训练中数据选择与优化器几何的对齐
论文标题:OPUS: Towards Efficient and Principled Data Selection in Large Language
...
大语言模型强化微调中的熵动力学分析
论文标题:On the Entropy Dynamics in Reinforcement Fine-Tuning of Large Language
...
Sea AI Lab 提出 DPPO:重新审视 PPO 算法中的信任域
论文标题:Rethinking the Trust Region in LLM Reinforcement Learning
论文链接:https:
...
