Discovering Language Model Behaviors with Model-Written Evaluations
Discovering Language Model Behaviors with Model-Written Evaluations
About this item
Full title
Author / Creator
Perez, Ethan , Ringer, Sam , Lukošiūtė, Kamilė , Nguyen, Karina , Chen, Edwin , Scott, Heiner , Pettit, Craig , Olsson, Catherine , Kundu, Sandipan , Kadavath, Saurav , Jones, Andy , Chen, Anna , Mann, Ben , Israel, Brian , Seethor, Bryan , McKinnon, Cameron , Olah, Christopher , Yan, Da , Amodei, Daniela , Amodei, Dario , Drain, Dawn , Li, Dustin , Tran-Johnson, Eli , Khundadze, Guro , Jackson Kernion , Landis, James , Kerr, Jamie , Mueller, Jared , Jeeyoon Hyun , Landau, Joshua , Ndousse, Kamal , Goldberg, Landon , Lovitt, Liane , Lucas, Martin , Sellitto, Michael , Zhang, Miranda , Kingsland, Neerav , Nelson Elhage , Nicholas, Joseph , Mercado, Noemí , DasSarma, Nova , Rausch, Oliver , Larson, Robin , McCandlish, Sam , Johnston, Scott , Kravec, Shauna , Sheer El Showk , Lanham, Tamera , Telleen-Lawton, Timothy , Brown, Tom , Henighan, Tom , Hume, Tristan , Bai, Yuntao , Hatfield-Dodds, Zac , Clark, Jack , Bowman, Samuel R , Askell, Amanda , Grosse, Roger , Hernandez, Danny , Ganguli, Deep , Hubinger, Evan , Schiefer, Nicholas and Kaplan, Jared
Publisher
Ithaca: Cornell University Library, arXiv.org
Journal title
Language
English
Formats
Publication information
Publisher
Ithaca: Cornell University Library, arXiv.org
Subjects
More information
Scope and Contents
Contents
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approach...
Alternative Titles
Full title
Discovering Language Model Behaviors with Model-Written Evaluations
Authors, Artists and Contributors
Author / Creator
Ringer, Sam
Lukošiūtė, Kamilė
Nguyen, Karina
Chen, Edwin
Scott, Heiner
Pettit, Craig
Olsson, Catherine
Kundu, Sandipan
Kadavath, Saurav
Jones, Andy
Chen, Anna
Mann, Ben
Israel, Brian
Seethor, Bryan
McKinnon, Cameron
Olah, Christopher
Yan, Da
Amodei, Daniela
Amodei, Dario
Drain, Dawn
Li, Dustin
Tran-Johnson, Eli
Khundadze, Guro
Jackson Kernion
Landis, James
Kerr, Jamie
Mueller, Jared
Jeeyoon Hyun
Landau, Joshua
Ndousse, Kamal
Goldberg, Landon
Lovitt, Liane
Lucas, Martin
Sellitto, Michael
Zhang, Miranda
Kingsland, Neerav
Nelson Elhage
Nicholas, Joseph
Mercado, Noemí
DasSarma, Nova
Rausch, Oliver
Larson, Robin
McCandlish, Sam
Johnston, Scott
Kravec, Shauna
Sheer El Showk
Lanham, Tamera
Telleen-Lawton, Timothy
Brown, Tom
Henighan, Tom
Hume, Tristan
Bai, Yuntao
Hatfield-Dodds, Zac
Clark, Jack
Bowman, Samuel R
Askell, Amanda
Grosse, Roger
Hernandez, Danny
Ganguli, Deep
Hubinger, Evan
Schiefer, Nicholas
Kaplan, Jared
Identifiers
Primary Identifiers
Record Identifier
TN_cdi_proquest_journals_2755992596
Permalink
https://devfeature-collection.sl.nsw.gov.au/record/TN_cdi_proquest_journals_2755992596
Other Identifiers
E-ISSN
2331-8422