publications | Hanyang Zhao

2025

Diffusion RLHF

DiFFPO: Training diffusion llms to reason fast and furious via reinforcement learning

Hanyang Zhao , Dawen Liang , Wenpin Tang , and 2 more authors

arXiv preprint arXiv:2510.02212, 2025

HTML
Diffusion RLHF

Fine-Tuning Diffusion Generative Models via Rich Preference Optimization

Hanyang Zhao* , Haoxian Chen* , Yucheng Guo* , and 5 more authors

arXiv preprint arXiv:2503.11720, 2025

Abs HTML

We propose a pipeline for curating synthetic rich preference pairs which is complementary to existing preference learning pipeline.
ICML 2025

Score as Action: Fine-Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning

Hanyang Zhao , Haoxian Chen , Ji Zhang , and 2 more authors

Forty-Second International Conference on Machine Learning, 2025

Short version in DeLTa Workshop at ICLR 2025.

Abs HTML

We propose a continuous-time Reinforcement Learning framework for aligning (score-based) diffusion models.
ICLR 2025

RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization

Hanyang Zhao* , Genta Indra Winata* , Anirban Das* , and 4 more authors

International Conference on Learning Representations, 2025

Abs HTML

We demystify the effective components among numerous xPOs.
ICLR 2025

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

Haoxian Chen* , Hanyang Zhao* , Henry Lam , and 2 more authors

International Conference on Learning Representations, 2025

Short version in Pluralistic Alignment Workshop at NeurIPS 2024.

Abs HTML

We include the concept of dispersion to generalize Bradley-Terry Model and improve the performance of DPO.

2024

NAACL 2025

Worldcuisines: A massive-scale benchmark for multilingual and multicultural visual question answering on global cuisines

Genta Indra Winata , Frederikus Hudi , Patrick Amadeus Irawan , and 8 more authors

2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics , 2024

Best Theme Paper Award.

HTML
JAIR

Preference tuning with human feedback on language, speech, and vision tasks: A survey

Genta Indra Winata* , Hanyang Zhao* , Anirban Das* , and 4 more authors

Journal of Artificial Intelligence Research, 2024

Abs HTML

We wrote a comprehensive survey for alignment of generative models, including models, datasets and methodologies.
Diffusion Models

Score-based Diffusion Models via Stochastic Differential Equations–a Technical Tutorial

Wenpin Tang , and Hanyang Zhao

Statistical Surveys, 2024

HTML
Diffusion Models

Contractive diffusion probabilistic models

Wenpin Tang , and Hanyang Zhao

arXiv preprint arXiv:2401.13115, 2024

HTML

2023

NeurIPS 2023

Policy optimization for continuous reinforcement learning

Hanyang Zhao , Wenpin Tang , and David Yao

Advances in Neural Information Processing Systems, 2023

Abs Bib HTML

@article{zhao2024policy,
  title = {Policy optimization for continuous reinforcement learning},
  author = {Zhao, Hanyang and Tang, Wenpin and Yao, David},
  journal = {Advances in Neural Information Processing Systems},
  volume = {36},
  year = {2023},
}