Publikation
Automated Ground Truth Data Generation for Newspaper Document Images
Thomas Strecker; Joost van Beusekom; Sahin Albayrak; Thomas Breuel
In: Proceedings of the 10th International Conference on Document Analysis and Recognition. International Conference on Document Analysis and Recognition (ICDAR-09), July 26-29, Barcelona, Spain, IEEE, 2009.
Zusammenfassung
In document image understanding, public datasets with
ground-truth are an important part of scientific work. They
are not only helpful for developing new methods, but also
provide a way of comparing performance. Generating these
datasets, however, is time consuming and cost-intensive
work, requiring a lot of manual effort. In this paper we
both propose a way to semi-automatically generate ground-
truthed datasets for newspapers and provide a comprehen-
sive dataset. The focus of this paper is layout analysis
ground truth. The proposed two step approach consists of a
module which automatically creates layouts and an image
matching module which allows to map the ground truth in-
formation from the synthetic layout to the scanned version.
In the first step, layouts are generated automatically from
a news corpus. The output consists of a digital newspaper
(PDF file) and an XML file containing geometric and log-
ical layout information. In the second step, the PDF files
are printed, scanned and aligned with the synthetic image
obtained by rendering the PDF. Finally, the geometric and
logical layout ground truth is mapped onto the scanned im-
age.