Skip to main content Skip to main navigation


Assessing authenticity and anonymity of synthetic user-generated content in the medical domain.

Tomohiro Nishiyama; Lisa Raithel; Roland Roller; Pierre Zweigenbaum; Eiji Aramaki
In: Proceedings of CALD-Pseudo at EACL 2024. Conference of the European Chapter of the Association for Computational Linguistics (EACL-2024), March 17-22, St. Julians, Malta, ACL, 2024.


Since medical text cannot be shared easily due to privacy concerns, synthetic data bears much potential for natural language processing applications. In the context of social media and user-generated messages about drug intake and adverse drug effects, this work presents different methods to examine the authenticity of synthetic text. We conclude that the generated tweets are untraceable and show enough authenticity from the medical point of view to be used as a replacement for a real Twitter corpus. However, original data might still be the preferred choice as they contain much more diversity.