Publication

PyDSMC: Statistical Model Checking for Neural Agents Using the Gymnasium Interface

Timo P. Gros; Arnd Hartmanns; Ivo Hoese; Joshua Meyer; Nicola Müller; Verena Wolf

In: Pavithra Prabhakar; Andrea Vandin (Hrsg.). Quantitative Evaluation of Systems and Formal Modeling and Analysis of Timed Systems - Second International Joint Conference, Proceedings. International Joint Conference on Quantitative Evaluation of Systems and Formal Modeling and Analysis of Timed Systems (QEST+FORMATS-2025), August 26-28, Aarhus, Denmark, Pages 134-156, Lecture Notes in Computer Science (LNCS), Vol. 16143, ISBN 978-3-032-05791-4, Springer Nature, Switzerland, 10/2025.

Abstract

Artificial intelligence (AI) has achieved remarkable success in sequential decision-making. However, evaluating its neural agents remains challenging, as current methods often rely on interpreting training curves only, overlooking key statistical factors. Existing tools that allow a formal evaluation also require white-box formal models, making them impractical for most AI benchmarks based on the black-box Gymnasium interface. We introduce PyDSMC, a lightweight and easy-to-use Python tool for statistical model checking of neural agents on arbitrary Gymnasium environments. PyDSMC automates the selection of statistical methods to compute confidence intervals, supporting both convergence-based and resource-limited evaluation settings. We empirically demon-strate the importance of rigorous agent evaluation and showcase PyDSMC's capabilities to more reliably judge and report an AI agent's performance.