Publication
Geometric Layout Analysis of Scanned Documents
Faisal Shafait
PhD-Thesis, Technische Universität Kaiserlautern, 5/2008.
Abstract
Layout analysis–the division of page images into text blocks, lines, and determination
of their reading order–is a major performance limiting step in large scale document digitization
projects. This thesis addresses this problem in several ways: it presents new
performance measures to identify important classes of layout errors, evaluates the performance
of state-of-the-art layout analysis algorithms, presents a number of methods to
reduce the error rate and catastrophic failures occurring during layout analysis, and develops
a statistically motivated, trainable layout analysis system that addresses the needs
of large-scale document analysis applications. An overview of the key contributions of
this thesis is as follows.
First, this thesis presents an efficient local adaptive thresholding algorithm that yields
the same quality of binarization as that of state-of-the-art local binarization methods,
but runs in time close to that of global thresholding methods, independent of the local
window size. Tests on the UW-1 dataset demonstrate a 20-fold speedup compared to
traditional local thresholding techniques.
Then, this thesis presents a new perspective for document image cleanup. Instead of
trying to explicitly detect and remove marginal noise, the approach focuses on locating
the page frame, i.e. the actual page contents area. A geometric matching algorithm
is presented to extract the page frame of a structured document. It is demonstrated
that incorporating page frame detection step into document processing chain results in a
reduction in OCR error rates from 4.3% to 1.7% (n = 4, 831, 618 characters) on the UWIII
dataset and layout-based retrieval error rates from 7.5% to 5.3% (n = 815 documents)
on the MARG dataset.
The performance of six widely used page segmentation algorithms (x-y cut, smearing,
whitespace analysis, constrained text-line finding, docstrum, and Voronoi) on the UWIII
database is evaluated in this work using a state-of-the-art evaluation methodology.
It is shown that current evaluation scores are insufficient for diagnosing specific errors
in page segmentation and fail to identify some classes of serious segmentation errors
altogether. Thus, a vectorial score is introduced that is sensitive to, and identifies, the
most important classes of segmentation errors (over-, under-, and mis-segmentation) and
what page components (lines, blocks, etc.) are affected. Unlike previous schemes, this
evaluation method has a canonical representation of ground truth data and guarantees
pixel-accurate evaluation results for arbitrary region shapes. Based on a detailed analysis
of the errors made by different page segmentation algorithms, this thesis presents a novel
combination of the line-based approach by Breuel [Bre02c] with the area-based approach
of Baird [Bai94] which solves the over-segmentation problem in area-based approaches.
This new approach achieves a mean text-line extraction error rate of 4.4% (n = 878
documents) on the UW-III dataset, which is the lowest among the analyzed algorithms.
This thesis also describes a simple, fast, and accurate system for document image
zone classification that results from a detailed comparative analysis of performance of
widely used features in document analysis and content-based image retrieval. Using a
novel combination of known algorithms, an error rate of 1.46% (n = 13, 811 zones) is
achieved on the UW-III dataset in comparison to a state-of-the-art system that reports
an error rate of 1.55% (n = 24, 177 zones) using more complicated techniques.
In addition to layout analysis of Roman script documents, this work also presents
the first high-performance layout analysis method for Urdu script. For that purpose a
geometric text-line model for Urdu script is presented. It is shown that the method can
accurately extract Urdu text-lines from documents of different layouts like prose books,
poetry books, magazines, and newspapers.
Finally, this thesis presents a novel algorithm for probabilistic layout analysis that
specifically addresses the needs of large-scale digitization projects. The presented approach
models known page layouts as a structural mixture model. A probabilistic matching
algorithm is presented that gives multiple interpretations of input layout with associated
probabilities. An algorithm based on A* search is presented for finding the most
likely layout of a page, given its structural layout model. For training layout models,
an EM-like algorithm is presented that is capable of learning the geometric variability
of layout structures from data, without the need for a page segmentation ground-truth.
Evaluation of the algorithm on documents from the MARG dataset shows an accuracy
of above 95% for geometric layout analysis.