Publikation
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming
Simone Tedeschi; Felix Friedrich; Patrick Schramowski; Kristian Kersting; Roberto Navigli; Huu Nguyen; Bo Li
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2404.08676, Pages 1-17, arXiv, 2024.
Zusammenfassung
When building Large Language Models
(LLMs), it is paramount to bear safety in mind
and protect them with guardrails. Indeed,
LLMs should never generate content promot-
ing or normalizing harmful, illegal, or unethical
behavior that may contribute to harm to individ-
uals or society. This principle applies to both
normal and adversarial use. In response, we
introduce ALERT, a large-scale benchmark to
assess safety based on a novel fine-grained risk
taxonomy. It is designed to evaluate the safety
of LLMs through red teaming methodologies
and consists of more than 45k instructions cat-
egorized using our novel taxonomy. By sub-
jecting LLMs to adversarial testing scenarios,
ALERT aims to identify vulnerabilities, inform
improvements, and enhance the overall safety
of the language models. Furthermore, the fine-
grained taxonomy enables researchers to per-
form an in-depth evaluation that also helps one
to assess the alignment with various policies.
In our experiments, we extensively evaluate
10 popular open- and closed-source LLMs and
demonstrate that many of them still struggle
to attain reasonable levels of safety. Warning:
this paper contains content that might be offen-
sive or upsetting in nature.
