Publikation
T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings
Björn Deiseroth; Manuel Brack; Patrick Schramowski; Kristian Kersting; Samuel Weinbach
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2406.19223, Pages 1-23, arXiv, 2024.
Zusammenfassung
Tokenizers are crucial for encoding information
in Large Language Models, but their develop-
ment has recently stagnated, and they contain
inherent weaknesses. Major limitations include
computational overhead, ineffective vocabulary
use, and unnecessarily large embedding and
head layers. Additionally, their performance
is biased towards a reference corpus, leading
to reduced effectiveness for underrepresented
languages. To remedy these issues, we pro-
pose T-FREE which directly embeds words
through sparse activation patterns over charac-
ter triplets, and does not require a reference
corpus. T-FREE inherently exploits morpholog-
ical similarities and allows for strong compres-
sion of embedding layers. In our exhaustive ex-
perimental evaluation, we achieve competitive
downstream performance with a parameter re-
duction of more than 85% on these layers. Fur-
ther, T-FREE shows significant improvements
in cross-lingual transfer learning.
