TransIns: Document Translation with Markup Reinsertion

Jörg Steffen, Josef van Genabith

In: Heike Adel, Shuming Shi (editor). Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Conference on Empirical Methods in Natural Language Processing (EMNLP-2021) November 7-11 Punta Cana Dominican Republic Pages 28-34 Association for Computational Linguistics 11/2021.


For many use cases, it is required that MT does not just translate raw text, but complex formatted documents (e.g. websites, slides, spreadsheets) and the result of the translation should reflect the formatting. This is challenging, as markup can be nested, apply to spans contiguous in source but non-contiguous in target etc. Here we present TransIns, a system for non-plain text document translation that builds on the Okapi framework and MT models trained with Marian NMT. We develop, implement and evaluate different strategies for reinserting markup into translated sentences using token alignments between source and target sentences. We propose a simple and effective strategy that compiles down all markup to single source tokens and transfers them to aligned target tokens. A first evaluation shows that this strategy yields highly accurate markup in the translated documents that outperforms the markup quality found in documents translated with popular translation services. We release TransIns under the MIT License as open-source software on An online demonstrator is available at


Weitere Links

TransIns_EMNLP_2021.pdf (pdf, 297 KB )

German Research Center for Artificial Intelligence
Deutsches Forschungszentrum für Künstliche Intelligenz