Publikation
OCR-Free Table of Contents Detection in Urdu Books
Adnan Ul Hasan; Syed Saqib Bukhari; Faisal Shafait; Thomas Breuel
In: IAPR International Workshop on Document Analysis Systems. IAPR International Workshop on Document Analysis Systems (DAS-12), 10th, March 27-29, Gold Coast, Queensland, Australia, IEEE, 3/2012.
Zusammenfassung
Table of Contents (ToC) is an integral part of
multiple-page documents like books, magazines, etc. Most of
the existing techniques use textual similarity for automatically
detecting ToC pages. However, such techniques may not be
applied for detection of ToC pages in situations where OCR
technology is not available, which is indeed true for historical
documents and many modern Nabataean (Arabic) and Indic
scripts. It is, therefore, necessary to develop tools to navigate
through such documents without the use of OCR. This paper
reports a preliminary effort to address this challenge. The
proposed algorithm has been applied to find Table of Contents
(ToC) pages in Urdu books and an overall initial accuracy of
88% has been achieved