Towards Understanding Sycophancy in Language Models
Towards Understanding Sycophancy in Language Models
About this item
Full title
Author / Creator
Sharma, Mrinank , Tong, Meg , Korbak, Tomasz , Duvenaud, David , Askell, Amanda , Bowman, Samuel R , Cheng, Newton , Durmus, Esin , Hatfield-Dodds, Zac , Johnston, Scott R , Kravec, Shauna , Maxwell, Timothy , McCandlish, Sam , Ndousse, Kamal , Rausch, Oliver , Schiefer, Nicholas , Yan, Da , Zhang, Miranda and Perez, Ethan
Publisher
Ithaca: Cornell University Library, arXiv.org
Journal title
Language
English
Formats
Publication information
Publisher
Ithaca: Cornell University Library, arXiv.org
Subjects
More information
Scope and Contents
Contents
Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback, and the potential role of human preference judg...
Alternative Titles
Full title
Towards Understanding Sycophancy in Language Models
Authors, Artists and Contributors
Author / Creator
Identifiers
Primary Identifiers
Record Identifier
TN_cdi_proquest_journals_2880585773
Permalink
https://devfeature-collection.sl.nsw.gov.au/record/TN_cdi_proquest_journals_2880585773
Other Identifiers
E-ISSN
2331-8422