Publication
Combining OCR Outputs for Logical Document Structure Markup. Technical Background to the ACL 2012 Contributed Task
Ulrich Schäfer; Benjamin Weitz
In: Proceedings of the ACL-2012 Main Conference Workshop on Rediscovering 50 Years of Discoveries. Annual Meeting of the Association for Computational Linguistics (ACL-2012), July 10, Jeju Island, Korea, Republic of, Pages 104-109, ISBN 978-1-937284-29-9, Association for Computational Linguistics, 7/2012.
Abstract
We describe how paperXML, a logical document
structure markup for scholarly articles,
is generated on the basis of OCR tool outputs.
PaperXML has been initially developed
for the ACL Anthology Searchbench. The
main purpose was to robustly provide uniform
access to sentences in ACL Anthology
papers from the past 46 years, ranging from
scanned, typewriter-written conference and
workshop proceedings papers, up to recent
high-quality typeset, born-digital journal articles,
with varying layouts. PaperXML markup
includes information on page and paragraph
breaks, section headings, footnotes, tables,
captions, boldface and italics character styles
as well as bibliographic and publication metadata.
The role of paperXML in the ACL Contributed
Task Rediscovering 50 Years of Discoveries
is to serve as fall-back source (1) for
older, scanned papers (mostly published before
the year 2000), for which born-digital
PDF sources are not available, (2) for borndigital
PDF papers on which the PDFExtract
method failed, (3) for document parts where
PDFExtract does not output useful markup
such as currently for tables. We sketch transformation
of paperXML into the ACL Contributed
Tasks TEI P5 XML.
Projekte
Deependance - Deep Dependency-Oriented Analysis with Non-Discrete Constraints,
TAKE - Technologies for Advanced Knowledge Extraction
TAKE - Technologies for Advanced Knowledge Extraction