Safe Alignment of LLMs under Reward Over-Optimization: Mitigating Catastrophic Goodhart Effect via Rényi Regularization-东北师范大学数学与统计学院

当前位置：首页 > 学术活动 > 正文

Safe Alignment of LLMs under Reward Over-Optimization: Mitigating Catastrophic Goodhart Effect via Rényi Regularization

时间：2026年06月15日 09:30 点击数：

报告人：张慧铭

报告地点：人民大街校区惟真楼523报告厅

报告时间：2026年06月15日星期一15:30—16:30

邀请人：数学与统计学院

报告摘要：

This talk addresses the challenge of machine learning generalization under heavy-tailed distributions, providing a theoretical foundation for the stable improvement of models such as Reinforcement Learning from Human Feedback (RLHF). We propose a novel heavy-tailed information-theoretic framework for generalization bounds. Traditional information-theoretic bounds typically control the generalization error using KL divergence, which relies on the existence of the moment generating function (MGF) of the loss or reward, or on sub-Gaussian tail assumptions. However, in modern machine learning paradigms such as RLHF and stochastic optimization, losses and rewards often are heavy-tailed, where the MGF may not exist, rendering KL-based tools ineffective. Even when the KL mutual information is small, rare extreme events can dominate generalization performance. We show that under heavy-tailed rewards, KL divergence-based trust regions in RLHF can suffer from the "Catastrophic Goodhart effect" — where the proxy reward can blow up without bound even when the KL constraint is small. In contrast, RLHF with Rényi regularization effectively controls reward inflation under heavy-tailed sub-Weibull rewards. Furthermore, for the Best-of-n strategy, we show that Rényi-regularized alignment provides finite reward guarantees and ensures that best-of-N policies remain well-controlled. Finally, we successfully apply Rényi regularization to Direct Preference Optimization (DPO) for LLMs.

主讲人简介：

张慧铭是北航人工智能研究院的副教授(准聘)、硕士生导师；北航数学科学学院兼职博导。曾在澳门大学担任濠江学者博后研究员(2020-2022)；曾就读于北京大学(2016-2020)获得统计学博士学位（师从陈松蹊院士）。研究方向：稳健机器学习, AI数学理论(泛化误差、非渐近\小样本理论)、高维概率统计（惩罚估计、集中不等式）、子抽样算法、莱维过程等。发表SCI论文30+篇(包括AI与自动化领域顶刊JMLR, IEEE-TAC；统计顶刊JASA, Biometrika、精算顶刊IME；Nature子刊Scientific Reports)，谷歌学术引用1000+(其中5篇(曾)为高被引论文)。曾担任美国《数学评论》评论员；概率统计、AI与机器学习领域顶刊(AOS,AOAP,JASA,JMLR,IEEE-TSP)的审稿人。