Lay- out Analysis for Arabic Historical Document Images Using Machine Learning

Syed Saqib Bukhari, Abedelkadir Asi, Jihad El-Sana, Thomas Breuel

In: 13th International Conference on Frontiers in Handwriting Recognition, ICFHR ’12, Bari, Italy, 2012.. International Conference on Frontiers in Handwriting Recognition (ICFHR-2012) September 18-20 Bari Italy IEEE 2012.


Page layout analysis is a fundamental step of any document image understanding system. We introduce an approach that segments text appearing in page margins (a.k.a side-notes text) from manuscripts with complex layout format. Simple and discriminative features are extracted in a connected-component level and subsequently robust feature vectors are generated. Multilayer perception classifier is exploited to classify connected components to the relevant class of text. A voting scheme is then applied to refine the resulting segmentation and produce the final classification. In contrast to state-of-the-art segmentation approaches, this method is independent of block segmentation, as well as pixel level analysis. The proposed method has been trained and tested on a dataset that contains a variety of complex side-notes layout formats, achieving a segmentation accuracy of about 95%.

German Research Center for Artificial Intelligence
Deutsches Forschungszentrum für Künstliche Intelligenz