Publication

Data Categories as the link between Langauge archives and NLP tools

Thierry Declerck; Mirjam Keßler

In: Proceedings of the Workshop "Language Archives: Standards, Creation and Access" at DGFS 2006. Annual Conference of the German Linguistic Society (DGfS), 2006.

Abstract

Our contribution is about an European Project, in the eContent framework, the LIRICS project: Linguistic Infrastructure for Interoperable Resources and Systems. This project addresses the needs of today's information and communication society where globalisation and localization necessitate multilingual communication creating an increasing need for new standardization as well as urgent recognition of existing de facto standards and their transformation into de jure International Standards.

In this, LIRICS provides ISO ratified standards for language technology to enable the exchange and reuse of multilingual language resources. LIRICS also provides for guidelines supporting the implementation of these standards for end-users by providing an open-source implementation platform, related web services and test suites building on legacy formats, tools and data. As for the resources, LIRICS covers the ISO standardisation of computational lexicons, morpho-syntactic, syntactic and semantic annotation for the language industry.

The model LIRICS is adopting for specifying and representing the various annotation schemes concerned -- morpho-syntax, syntax and semantic -- is based on the general principles of the Linguistic Annotation Framework, which is an on-going project within the ISO committee TC 37/SC 4. Those principles consider a class of semi-structured documents that can be specified through the combination of, on the one hand, a meta-model that informs the general practices in organizing information in a given application domain, and, on the other hand, a selection of data categories (DCS), that characterizes the elementary information units that can be attached to the various sub-structures (or components) of the meta-model. Indeed, the components in the meta-model should be considered as elementary linguistic abstractions that reflect the granularity of the description intended by the meta-model.