博士生讨论班2026[02]
作者:
时间:2026-03-13
阅读量:64次
  • 演讲人: 郜源善
  • 时间:2026年3月17日14:00
  • 地点:浙江大学紫金港校区行政楼1417报告厅
  • 主办单位:浙江大学数据科学研究中心

标题:Beyond PPO: Architecture Reduction, Loss Reconstruction, and Statistical Robustness in LLM Alignment

Abstract:

To guarantee generation quality and safety, Large Language Models (LLMs) must undergo human preference alignment following pre-training and supervised fine-tuning (SFT). Reinforcement Learning from Human Feedback (RLHF), standardized by the PPO algorithm in InstructGPT, remains the predominant paradigm. However, standard PPO suffers from significant memory overhead and training instability due to its intricate Actor-Critic architecture.

This seminar methodically reviews the recent algorithmic evolution of LLM alignment, tracing a clear trajectory from engineering lightweighting to fundamental statistical reconstruction. We start with the core principles of PPO and explore how GRPO achieves online lightweight alignment by eliminating the Critic network via group-relative estimation. Transitioning to the offline optimization paradigm, we examine how DPO (Direct Preference Optimization) elegantly recasts the complex RL objective into a standard maximum likelihood estimation problem, and how variants like SimPO address its inherent empirical biases.

Finally, the talk will explore the frontier of statistical robustness in alignment techniques. We will discuss how frameworks like IPO (under the $\Psi$-PO family) modify preference mapping mechanisms to mitigate overfitting to noisy data. Delving deeper into theoretical guarantees, we will highlight how DRPO innovatively adapts the "doubly robust" concept from causal inference, ensuring convergence even under reference policy or preference model misspecification. This comprehensive review aims to provide a unified perspective on the theoretical and practical advancements driving modern LLM alignment.