Publication
Data Categories as the link between Langauge archives and NLP tools
Thierry Declerck; Mirjam KeßlerAbstract
In this, LIRICS provides ISO ratified standards for language technology to enable the exchange and reuse of multilingual language resources. LIRICS also provides for guidelines supporting the implementation of these standards for end-users by providing an open-source implementation platform, related web services and test suites building on legacy formats, tools and data. As for the resources, LIRICS covers the ISO standardisation of computational lexicons, morpho-syntactic, syntactic and semantic annotation for the language industry.
The model LIRICS is adopting for specifying and representing the various annotation schemes concerned -- morpho-syntax, syntax and semantic -- is based on the general principles of the Linguistic Annotation Framework, which is an on-going project within the ISO committee TC 37/SC 4. Those principles consider a class of semi-structured documents that can be specified through the combination of, on the one hand, a meta-model that informs the general practices in organizing information in a given application domain, and, on the other hand, a selection of data categories (DCS), that characterizes the elementary information units that can be attached to the various sub-structures (or components) of the meta-model. Indeed, the components in the meta-model should be considered as elementary linguistic abstractions that reflect the granularity of the description intended by the meta-model.