Publikation
T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings
Björn Deiseroth; Manuel Brack; Patrick Schramowski; Kristian Kersting; Samuel Weinbach
In: Yaser Al-Onaizan; Mohit Bansal; Yun-Nung Chen (Hrsg.). Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024. Conference on Empirical Methods in Natural Language Processing (EMNLP), Pages 21829-21851, Association for Computational Linguistics, 2024.
Zusammenfassung
Tokenizers are crucial for encoding information
in Large Language Models, but their develop-
ment has recently stagnated, and they contain
inherent weaknesses. Major limitations include
computational overhead, ineffective vocabulary
use, and unnecessarily large embedding and
head layers. Additionally, their performance
is biased towards a reference corpus, leading
to reduced effectiveness for underrepresented
languages. To remedy these issues, we pro-
pose T-FREE which directly embeds words
through sparse activation patterns over charac-
ter triplets, and does not require a reference
corpus. T-FREE inherently exploits morpholog-
ical similarities and allows for strong compres-
sion of embedding layers. In our exhaustive ex-
perimental evaluation, we achieve competitive
downstream performance with a parameter re-
duction of more than 85% on these layers. Fur-
ther, T-FREE shows significant improvements
in cross-lingual transfer learning.
