Publication
Document Image Segmentation using Discriminative Learning over Connected Components
Syed Saqib Bukhari; Mayce Al-Azawi; Faisal Shafait; Thomas Breuel
In: 9th IAPR Workshop on Document Analysis Systems. IAPR International Workshop on Document Analysis Systems (DAS-2010), June 9-11, Boston, MA, USA, ACM, 6/2010.
Abstract
Segmentation of a document image into text and non-text
regions is an important preprocessing step for a variety of
document image analysis tasks, like improving OCR, document compression etc. Most of the state-of-the-art document image segmentation approaches perform segmentation using pixel-based or zone(block)-based classification.
Pixel-based classification approaches are time consuming,
whereas block-based methods heavily depend on the accuracy of block segmentation step. In contrast to the state-of-the-art document image segmentation approaches, our segmentation approach introduces connected component based
classification, thereby not requiring a block segmentation
beforehand. Here we train a self-tunable multi-layer perceptron (MLP) classifier for distinguishing between text and
non-text connected components using shape and context information as a feature vector. Experimental results prove
the effectiveness of our proposed algorithm. We have evaluated our method on subset of UW-III, ICDAR 2009 page
segmentation competition test images and circuit diagrams
datasets and compared its results with the state-of-the-art
leptonica's 1 page segmentation algorithm.