Publication
Enhancing Chinese Word Segmentation Using Unlabeled Data
Weiwei Sun; Jia Xu
In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing (EMNLP-2011), July 27-31, Edinburgh, Scotland, United Kingdom, Pages 970-979, ACL, 7/2011.
Abstract
This paper investigates improving supervised
word segmentation accuracy with unlabeled
data. Both large-scale in-domain data and
small-scale document text are considered. We
present a unified solution to include features
derived from unlabeled data to a discriminative
learning model. For the large-scale data,
we derive string statistics from Gigaword to
assist a character-based segmenter. In addition,
we introduce the idea about transductive,
document-level segmentation, which is designed
to improve the system recall for out-ofvocabulary
(OOV) words which appear more
than once inside a document. Novel features1
result in relative error reductions of 13.8% and
15.4% in terms of F-score and the recall of
OOV words respectively.