Skip to main content Skip to main navigation

Publikation

T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

Björn Deiseroth; Manuel Brack; Patrick Schramowski; Kristian Kersting; Samuel Weinbach
In: Yaser Al-Onaizan; Mohit Bansal; Yun-Nung Chen (Hrsg.). Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024. Conference on Empirical Methods in Natural Language Processing (EMNLP), Pages 21829-21851, Association for Computational Linguistics, 2024.

Zusammenfassung

Tokenizers are crucial for encoding information in Large Language Models, but their develop- ment has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we pro- pose T-FREE which directly embeds words through sparse activation patterns over charac- ter triplets, and does not require a reference corpus. T-FREE inherently exploits morpholog- ical similarities and allows for strong compres- sion of embedding layers. In our exhaustive ex- perimental evaluation, we achieve competitive downstream performance with a parameter re- duction of more than 85% on these layers. Fur- ther, T-FREE shows significant improvements in cross-lingual transfer learning.

Weitere Links