Log in to save to my catalogue

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

https://devfeature-collection.sl.nsw.gov.au/record/TN_cdi_proquest_journals_2715606750

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Publication information

Publisher

Ithaca: Cornell University Library, arXiv.org

Subjects

Subjects and topics

More information

Scope and Contents

Contents

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM)...

Alternative Titles

Full title

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Identifiers

Primary Identifiers

Record Identifier

TN_cdi_proquest_journals_2715606750

Permalink

https://devfeature-collection.sl.nsw.gov.au/record/TN_cdi_proquest_journals_2715606750

Other Identifiers

E-ISSN

2331-8422

How to access this item