Publication

Semi-Automated OCR Database Generation for Complex Scripts

A. Ul-Hasan; Syed Saqib Bukhari; Sheikh Faisal Rashid; Faisal Shafait; Thomas Breuel

In: 21st International Conference on Pattern Recognition, ICPR’12, Japan, November 2012. International Conference on Pattern Recognition (ICPR-2012), November 11-15, Tsukuba Science City, Japan, IEEE, 2012.

Abstract

A large amount of real-world data is required to train and benchmark any character recognition algorithm. Developing a page-level ground-truth database for this purpose is overwhelmingly laborious, as it involves a lot of manual efforts to produce a reason-able database that covers all possible words of a language. Moreover, generating such a database for historical (degraded) documents or for a cursive script like Urdu 1 is even more complex and grueling. The presented work attempts to solve this problem by proposing a semi-automated technique for generating ground-truth database. It is believed that the proposed automation will greatly reduce the manual efforts for developing any OCR database. The basic idea is to apply ligature-clustering prior to manual labeling. Two prototype datasets for Urdu script have been developed us-ing the proposed technique and the results are also presented.