当前位置: 首页 > 学术活动 > 正文
Safe Alignment of LLMs under Reward Over-Optimization: Mitigating Catastrophic Goodhart Effect via Rényi Regularization
时间:2026年06月15日 09:30 点击数:

报告人:张慧铭

报告地点:人民大街校区惟真楼523报告厅

报告时间:2026年06月15日星期一15:30—16:30

邀请人:数学与统计学院

报告摘要:

This talk addresses the challenge of machine learning generalization under heavy-tailed distributions, providing a theoretical foundation for the stable improvement of models such as Reinforcement Learning from Human Feedback (RLHF). We propose a novel heavy-tailed information-theoretic framework for generalization bounds. Traditional information-theoretic bounds typically control the generalization error using KL divergence, which relies on the existence of the moment generating function (MGF) of the loss or reward, or on sub-Gaussian tail assumptions. However, in modern machine learning paradigms such as RLHF and stochastic optimization, losses and rewards often are heavy-tailed, where the MGF may not exist, rendering KL-based tools ineffective. Even when the KL mutual information is small, rare extreme events can dominate generalization performance. We show that under heavy-tailed rewards, KL divergence-based trust regions in RLHF can suffer from the "Catastrophic Goodhart effect" — where the proxy reward can blow up without bound even when the KL constraint is small. In contrast, RLHF with Rényi regularization effectively controls reward inflation under heavy-tailed sub-Weibull rewards. Furthermore, for the Best-of-n strategy, we show that Rényi-regularized alignment provides finite reward guarantees and ensures that best-of-N policies remain well-controlled. Finally, we successfully apply Rényi regularization to Direct Preference Optimization (DPO) for LLMs.

主讲人简介:

张慧铭是北航人工智能研究院的副教授(准聘)、硕士生导师;北航数学科学学院兼职博导。曾在澳门大学担任濠江学者博后研究员(2020-2022);曾就读于北京大学(2016-2020)获得统计学博士学位(师从陈松蹊院士)。研究方向:稳健机器学习, AI数学理论(泛化误差、非渐近\小样本理论)、高维概率统计(惩罚估计、集中不等式)、子抽样算法、莱维过程等。发表SCI论文30+篇(包括AI与自动化领域顶刊JMLR, IEEE-TAC;统计顶刊JASA, Biometrika、精算顶刊IME;Nature子刊Scientific Reports),谷歌学术引用1000+(其中5篇(曾)为高被引论文)。曾担任美国《数学评论》评论员;概率统计、AI与机器学习领域顶刊(AOS,AOAP,JASA,JMLR,IEEE-TSP)的审稿人。

©2019 东北师范大学数学与统计学院 版权所有

地址:吉林省长春市人民大街5268号 邮编:130024 电话:0431-85099589 传真:0431-85098237

师德师风监督举报电话、邮箱:85099577 sxdw@nenu.edu.cn