Automatic Induction of Named Entity Classes from Legal Text Corpora

Peter Bourgonje, Anna Breit, Maria Khvalchik, Victor Mireles, Julian Moreno Schneider, Artem Revenko, Georg Rehm

In: Manolis Koubarakis, Harith Alani, Grigoris Antoniou, Kalina Bontcheva, John Breslin, Diego Collarana, Elena Demidova, Stefan Dietze, Simon Gottschalk, Guido Governatori, Aidan Hogan, Freddy Lecue, Elena Montiel Ponsoda, Axel-Cyrille Ngonga Ngomo, Sofia Pinto, Muhammad Saleem, Raphael Troncy, Eleni Tsalapati, Ricardo Usbeck, Ruben Verborgh (editor). ASLD 2020 -- Advances in Semantics and Linked Data: Joint Workshop Proceedings from ISWC 2020. International Workshop on Artificial Intelligence for Legal Documents (AI4LEGAL-2020) November 2-3 Athens/Virtual Greece Pages 1-11 CEUR Workshop Proceedings 11/2020.


Named Entity Recognition tools and datasets are widely used. The standard pre-trained models, however often do not cover specific application needs as these models are too generic. We introduce a methodology to automatically induce fine-grained classes of named entities for the legal domain. Specifically, given a corpus which has been annotated with instances of coarse entity classes, we show how to induce fine-grained, domain specific (sub-)classes. The method relies on predictions of the masked tokens generated by a pre-trained language model. These predictions are then collected and clustered. The clusters are then taken as the new candidate classes. We develop an implementation of the introduced method and experiment with a large legal corpus in German language that is manually annotated with almost 54,000 named entities.


ai4legal2020-automatic-induction-of-named-entity-classes-from-legal-text-corpora.pdf (pdf, 851 KB )

German Research Center for Artificial Intelligence
Deutsches Forschungszentrum für Künstliche Intelligenz