WPO: Enhancing RLHF with Weighted Preference Optimization

https://devfeature-collection.sl.nsw.gov.au/record/TN_cdi_proquest_journals_3069346944

WPO: Enhancing RLHF with Weighted Preference Optimization

About this item

Full title

Author / Creator

Zhou, Wenxuan , Agrawal, Ravi , Zhang, Shujian , Sathish Reddy Indurthi , Zhao, Sanqiang , Song, Kaiqiang , Xu, Silei and Zhu, Chenguang

Publisher

Ithaca: Cornell University Library, arXiv.org

Journal title

arXiv.org, 2024-10

Language

English

Formats

Articles

Publication information

Publisher

Ithaca: Cornell University Library, arXiv.org

Subjects

Subjects and topics

More information

Scope and Contents

Contents

Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values. Off-policy preference optimization, where the preference data is obtained from other models, is widely adopted due to its cost efficiency and scalability. However, off-policy preference optimization often suffers from a distributional gap between the policy used for data collection and the target policy, leading to suboptimal optimization. In this paper, we propose a novel strategy to mitigate this problem by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. We validate our method on instruction following benchmarks including Alpaca Eval 2 and MT-bench. WPO not only outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2 but also establishes a remarkable length-controlled winning rate against GPT-4-turbo of 76.7% based on Gemma-2-9b-it. We release the code and models at https://github.com/wzhouad/WPO....

Alternative Titles

Full title

WPO: Enhancing RLHF with Weighted Preference Optimization

Authors, Artists and Contributors

Author / Creator

Zhou, Wenxuan
Agrawal, Ravi
Zhang, Shujian
Sathish Reddy Indurthi
Zhao, Sanqiang
Song, Kaiqiang
Xu, Silei
Zhu, Chenguang

Identifiers

Primary Identifiers

Record Identifier

TN_cdi_proquest_journals_3069346944

Permalink

https://devfeature-collection.sl.nsw.gov.au/record/TN_cdi_proquest_journals_3069346944

Other Identifiers

E-ISSN

2331-8422

How to access this item

Full text available

View in old catalogue

WPO: Enhancing RLHF with Weighted Preference Optimization

WPO: Enhancing RLHF with Weighted Preference Optimization

WPO: Enhancing RLHF with Weighted Preference Optimization

About this item

Publication information

Subjects

More information

Scope and Contents

Alternative Titles

Authors, Artists and Contributors

Identifiers

Primary Identifiers

Other Identifiers

How to access this item

Connecting people and collections

Indigenous engagement

Learning

Stories