Skip to main content Skip to main navigation

Publikation

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

Simone Tedeschi; Felix Friedrich; Patrick Schramowski; Kristian Kersting; Roberto Navigli; Huu Nguyen; Bo Li
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2404.08676, Pages 1-17, arXiv, 2024.

Zusammenfassung

When building Large Language Models (LLMs), it is paramount to bear safety in mind and protect them with guardrails. Indeed, LLMs should never generate content promot- ing or normalizing harmful, illegal, or unethical behavior that may contribute to harm to individ- uals or society. This principle applies to both normal and adversarial use. In response, we introduce ALERT, a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It is designed to evaluate the safety of LLMs through red teaming methodologies and consists of more than 45k instructions cat- egorized using our novel taxonomy. By sub- jecting LLMs to adversarial testing scenarios, ALERT aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models. Furthermore, the fine- grained taxonomy enables researchers to per- form an in-depth evaluation that also helps one to assess the alignment with various policies. In our experiments, we extensively evaluate 10 popular open- and closed-source LLMs and demonstrate that many of them still struggle to attain reasonable levels of safety. Warning: this paper contains content that might be offen- sive or upsetting in nature.

Weitere Links