Publikation

T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

Björn Deiseroth; Manuel Brack; Patrick Schramowski; Kristian Kersting; Samuel Weinbach

In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2406.19223, Pages 1-23, arXiv, 2024.

Zusammenfassung

Tokenizers are crucial for encoding information in Large Language Models, but their develop- ment has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we pro- pose T-FREE which directly embeds words through sparse activation patterns over charac- ter triplets, and does not require a reference corpus. T-FREE inherently exploits morpholog- ical similarities and allows for strong compres- sion of embedding layers. In our exhaustive ex- perimental evaluation, we achieve competitive downstream performance with a parameter re- duction of more than 85% on these layers. Fur- ther, T-FREE shows significant improvements in cross-lingual transfer learning.

Weitere Links

https://doi.org/10.48550/arXiv.2406.19223

2406.19223v2.pdf (pdf, 3 MB )