Publikation
Connected Component level Multiscript Identifiation from Ancient Document Images
Sheikh Faisal Rashid; Faisal Shafait; Thomas Breuel
In: 9th IAPR Workshop on Document Analysis Systems. IAPR International Workshop on Document Analysis Systems (DAS-2010), June 9-11, Boston, MA, USA, online only (DAS Workshop Web page), 6/2010.
Zusammenfassung
In a multilingual optical character recognition (MOCR) environment, a MOCR system usually requires script identification of each word or character before passing it to the
OCR engine. Many existing script identification techniques
mainly depend on various features extracted from document
images at character, word, text line or block level. This feature extraction process is not robust and extracted features
for one script identification problem are not fully applicable
to other script identification problems. In this paper, we
present a novel and efficient technique for multi-script identification at connected component level using convolutional
neural network. The convolutional neural network acts as
a discriminative learning model, where suitable script identification features are automatically extracted and learned
as convolution kernels from the raw input. We test our approach on a dataset of ancient Greek-Latin mix script document images. We achieve 96.37% accuracy on a test dataset
at connected component level and this accuracy is further
improved to 98.40% by using class majority in the left-right
neighboring area. The main advantage of our approach is
that it can be easily adapted for the identification of other
scripts and it can give 100% accuracy at block level.