Semi-Automated OCR Database Generation for Complex Scripts

A. Ul-Hasan, Syed Saqib Bukhari, Sheikh Faisal Rashid, Faisal Shafait, Thomas Breuel

In: 21st International Conference on Pattern Recognition, ICPR’12, Japan, November 2012.. International Conference on Pattern Recognition (ICPR-2012) November 11-15 Tsukuba Science City Japan IEEE 2012.


A large amount of real-world data is required to train and benchmark any character recognition algorithm. Developing a page-level ground-truth database for this purpose is overwhelmingly laborious, as it involves a lot of manual efforts to produce a reason-able database that covers all possible words of a language. Moreover, generating such a database for historical (degraded) documents or for a cursive script like Urdu 1 is even more complex and grueling. The presented work attempts to solve this problem by proposing a semi-automated technique for generating ground-truth database. It is believed that the proposed automation will greatly reduce the manual efforts for developing any OCR database. The basic idea is to apply ligature-clustering prior to manual labeling. Two prototype datasets for Urdu script have been developed us-ing the proposed technique and the results are also presented.

German Research Center for Artificial Intelligence
Deutsches Forschungszentrum für Künstliche Intelligenz