The importance of sharing patient-generated clinical speech and language data

Kathleen C. Fraser, Nicklas Linz, Hali Lindsay, Alexandra König

In: Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology: Reconciling Outcomes. Computational Linguistics and Clinical Psychology Workshop (CLPsych-2019) 6th befindet sich 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) June 6 Minneapolis MN United States 2019.


Increased access to large datasets has driven progress in NLP. However, most computational studies of clinically-validated, patient-generated speech and language involve very few datapoints, as such data are difficult (and expensive) to collect. In this position paper, we argue that we must find ways to promote data sharing across research groups, in order to build datasets of a more appropriate size for NLP and machine learning analysis. We review the benefits and challenges of sharing clinical language data, and suggest several concrete actions by both clinical and NLP researchers to encourage multi-site and multidisciplinary data sharing. We also propose the creation of a collaborative data sharing platform, to allow NLP researchers to take a more active responsibility for data transcription, annotation, and curation.

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence