Safe Alignment of LLMs under Reward Over-Optimization: Mitigating Catastrophic Goodhart Effect via Rényi Regularization
报告人:张慧铭
报告地点:人民大街校区惟真楼523报告厅
报告时间:2026年06月15日星期一15:30—16:30
邀请人:数学与统计学院
报告摘要:
This talk addresses the challenge of machine learning generalization under heavy-tailed distributions, providing a theoretical foundation for the stable improvement of models such as Reinforcement Learning from Human Feedback (RLHF). We propose a novel heavy-tailed information-theoretic framework for generalization bounds. Traditional information-theoretic bounds typically control the generalization error using KL divergence, which relies on the existence of the moment generating function (MGF) of the loss or reward, or on sub-Gaussian tail assumptions. However, in modern machine learning paradigms such as RLHF and stochastic optimization, losses and rewards often are heavy-tailed, where the MGF may not exist, rendering KL-based tools ineffective. Even when the KL mutual information is small, rare extreme events can dominate generalization performance. We show that under heavy-tailed rewards, KL divergence-based trust regions in RLHF can suffer from the "Catastrophic Goodhart effect" — where the proxy reward can blow up without bound even when the KL constraint is small. In contrast, RLHF with Rényi regularization effectively controls reward inflation under heavy-tailed sub-Weibull rewards. Furthermore, for the Best-of-n strategy, we show that Rényi-regularized alignment provides finite reward guarantees and ensures that best-of-N policies remain well-controlled. Finally, we successfully apply Rényi regularization to Direct Preference Optimization (DPO) for LLMs.
主讲人简介:
张慧铭是北航人工智能研究院的副教授(准聘)、硕士生导师;北航数学科学学院兼职博导。曾在澳门大学担任濠江学者博后研究员(2020-2022);曾就读于北京大学(2016-2020)获得统计学博士学位(师从陈松蹊院士)。研究方向:稳健机器学习, AI数学理论(泛化误差、非渐近\小样本理论)、高维概率统计(惩罚估计、集中不等式)、子抽样算法、莱维过程等。发表SCI论文30+篇(包括AI与自动化领域顶刊JMLR, IEEE-TAC;统计顶刊JASA, Biometrika、精算顶刊IME;Nature子刊Scientific Reports),谷歌学术引用1000+(其中5篇(曾)为高被引论文)。曾担任美国《数学评论》评论员;概率统计、AI与机器学习领域顶刊(AOS,AOAP,JASA,JMLR,IEEE-TSP)的审稿人。