Home > News & Events > Events Content
Speaker: Fang Kong is an Assistant Professor and Ph.D. advisor at the Southern University of Science and Technology. She received her Ph.D. from Shanghai Jiao Tong University and her bachelor's degree from Shandong University. Her research interests include multi-armed bandits, reinforcement learning theory, and their applications to large language models. In recent years, she has published more than 20 papers in top-tier conferences, including SODA, COLT, ICML, and NeurIPS. She has served as an Area Chair for NeurIPS and as a reviewer for COLT, ICML, TPAMI, and other leading venues. She has received several awards, including the CCF Outstanding Doctoral Dissertation Award in Agents and Multi-Agent Systems and the Baidu Scholarship.
Date: June 17, 2026
Time: 10:00-11:00 am
Location: Room 310, Office Building, Software Campus, Shandong University
Sponsor: School of Software, Shandong University
Abstract:
Online learning improves decisions through interaction, where optimism in the face of uncertainty, commonly implemented by upper confidence bound (UCB) methods, drives exploration toward actions with high potential. Such optimistic exploration is essential for achieving favorable long-term regret, but it can be costly when only a limited number of online rounds are available. Offline learning, in contrast, learns from historical data without further interaction. Due to limited data coverage, offline methods often adopt pessimism, commonly implemented by lower confidence bound (LCB) principles, to avoid unsupported actions and obtain decisions with certified performance. Offline-to-online learning seeks to combine the strengths of these two paradigms by starting from offline data and continuing to improve through online interaction. This setting gives rise to a fundamental horizon-dependent trade-off: short horizons favor the LCB solution supported by offline data, whereas long horizons require UCB-style exploration to improve beyond the offline solution. The key challenge is therefore to adaptively decide when to explore optimistically and when to fall back to a pessimistic baseline, so that performance remains competitive across all horizons. We first study this trade-off in stochastic bandits and propose Conservative Optimism with Pessimistic Baseline (COPB), an anytime algorithm that permits UCB exploration only when it is certified to be safe relative to an LCB baseline. This conservative-optimistic design preserves short-horizon reliability while enabling long-horizon improvement. We prove that, at every horizon, COPB competes with the better of LCB and UCB up to constant terms. We further validate its effectiveness against baselines and extend the theoretical framework to reinforcement learning.
For more information, please visit:
https://www.view.sdu.edu.cn/info/1020/212781.htm