Publication

Automatic Security-Flaw Detection — Towards a Fair Evaluation and Comparison

Bernhard Berger; Christina Plump

In: Software and Systems Modeling (SoSyM), Springer Science and Business Media LLC, 2025.

Abstract

Threat Modeling is an essential step in secure software system development. It is a (so far) manual, attacker-centric approach for identifying architecture-level security flaws during the planning phase of software systems. In recent years, academia has presented ideas to automate threat detection that do not focus on a particular class of security flaws but offer means of pattern-based security flaw descriptions. However, comparing presented ideas (tools) for automated threat detection contains the potential for unwilling bias or restricted information content. In this work, we investigate the process of comparing automatic security flaw detection tools, clarify common pitfalls during this process, and propose a fair, reproducible, and informative comparison approach to be used as a community standard. We additionally discuss the necessary steps for the community to effectively implement this approach and support improved comparisons and evaluations in the future. We use a previously published case study to determine problems with current comparison techniques and classify different levels of comparison to be used for future reference as our main contribution. As a consequence, we propose using a model-based approach for specifying security flaws and apply an existing natural language-based catalogue to this model-based approach. Furthermore, we introduce an inspection process model (for providing a standard to specify findings of a threat detection process) to streamline the evaluation and comparisons of automatic security flaw detection tools. We provide an exemplary evaluation of this detection guideline and inspection process model along the lines of both automatic approaches from the original case study. All artefacts of the work are publicly available to support the research community and to create a common baseline for future tool comparisons.