Skip to main content Skip to main navigation

Publication

PyDSMC: Statistical Model Checking for Neural Agents Using the Gymnasium Interface

Timo P. Gros; Arnd Hartmanns; Ivo Hoese; Joshua Meyer; Nicola Müller; Verena Wolf
In: Pavithra Prabhakar; Andrea Vandin (Hrsg.). Quantitative Evaluation of Systems and Formal Modeling and Analysis of Timed Systems: Second International Joint Conference, QEST+ FORMATS 2025, Aarhus, Denmark, August 26–28, 2025, Proceedings, Vol. 16143, Pages 134-156, Springer Nature, Switzerland, 10/2025.

Abstract

Artificial intelligence (AI) has achieved remarkable success in sequential decision-making. However, evaluating its neural agents remains challenging, as current methods often rely on interpreting training curves only, overlooking key statistical factors. Existing tools that allow a formal evaluation also require white-box formal models, making them impractical for most AI benchmarks based on the black-box Gymnasium interface. We introduce PyDSMC, a lightweight and easy-to-use Python tool for statistical model checking of neural agents on arbitrary Gymnasium environments. PyDSMC automates the selection of statistical methods to compute confidence intervals, supporting both convergence-based and resource-limited evaluation settings. We empirically demon-strate the importance of rigorous agent evaluation and showcase PyDSMC's capabilities to more reliably judge and report an AI agent's performance.