Mitigating adversarial manipulation in LLMs: a prompt-based approach to counter Jailbreak attacks (P...

https://devfeature-collection.sl.nsw.gov.au/record/TN_cdi_doaj_primary_oai_doaj_org_article_4e09a8cf4a654b9fb37117a1ad5498e5

Mitigating adversarial manipulation in LLMs: a prompt-based approach to counter Jailbreak attacks (Prompt-G)

About this item

Full title

Mitigating adversarial manipulation in LLMs: a prompt-based approach to counter Jailbreak attacks (Prompt-G)

Author / Creator

Pingua, Bhagyajit , Murmu, Deepak , Kandpal, Meenakshi , Rautaray, Jyotirmayee , Mishra, Pranati , Barik, Rabindra Kumar and Saikia, Manob Jyoti

Publisher

United States: PeerJ. Ltd

Journal title

PeerJ. Computer science, 2024-10, Vol.10, p.e2374, Article e2374

Language

English

Formats

Articles

Publication information

Publisher

United States: PeerJ. Ltd

Subjects

Subjects and topics

More information

Scope and Contents

Contents

Large language models (LLMs) have become transformative tools in areas like text generation, natural language processing, and conversational AI. However, their widespread use introduces security risks, such as jailbreak attacks, which exploit LLM’s vulnerabilities to manipulate outputs or extract sensitive information. Malicious actors can use LLMs to spread misinformation, manipulate public opinion, and promote harmful ideologies, raising ethical concerns. Balancing safety and accuracy require carefully weighing potential risks against benefits. Prompt Guarding (Prompt-G) addresses these challenges by using vector databases and embedding techniques to assess the credibility of generated text, enabling real-time detection and filtering of malicious content. We collected and analyzed a dataset of Self Reminder attacks to identify and mitigate jailbreak attacks, ensuring that the LLM generates safe and accurate responses. In various attack scenarios, Prompt-G significantly reduced jailbreak success rates and effectively identified prompts that caused confusion or distraction in the LLM. Integrating our model with Llama 2 13B chat reduced the attack success rate (ASR) to 2.08%. The source code is available at:
https://doi.org/10.5281/zenodo.13501821
....

Alternative Titles

Full title

Mitigating adversarial manipulation in LLMs: a prompt-based approach to counter Jailbreak attacks (Prompt-G)

Authors, Artists and Contributors

Author / Creator

Pingua, Bhagyajit
Murmu, Deepak
Kandpal, Meenakshi
Rautaray, Jyotirmayee
Mishra, Pranati
Barik, Rabindra Kumar
Saikia, Manob Jyoti

Identifiers

Primary Identifiers

Record Identifier

TN_cdi_doaj_primary_oai_doaj_org_article_4e09a8cf4a654b9fb37117a1ad5498e5

Permalink

https://devfeature-collection.sl.nsw.gov.au/record/TN_cdi_doaj_primary_oai_doaj_org_article_4e09a8cf4a654b9fb37117a1ad5498e5

Other Identifiers

ISSN

2376-5992

E-ISSN

2376-5992

DOI

10.7717/peerj-cs.2374

How to access this item

View record in Gale

About resource

View in old catalogue

Mitigating adversarial manipulation in LLMs: a prompt-based approach to counter Jailbreak attacks (P...

Mitigating adversarial manipulation in LLMs: a prompt-based approach to counter Jailbreak attacks (P...

Mitigating adversarial manipulation in LLMs: a prompt-based approach to counter Jailbreak attacks (Prompt-G)

About this item

Publication information

Subjects

More information

Scope and Contents

Alternative Titles

Authors, Artists and Contributors

Identifiers

Primary Identifiers

Other Identifiers

How to access this item

Connecting people and collections

Indigenous engagement

Learning

Stories