Skip to main content Skip to main navigation

Publication

Divergent Token Metrics: Measuring degradation to prune away LLM components - and optimize quantization

Björn Deiseroth; Max Meuer; Nikolas Gritsch; Constantin Eichenberg; Patrick Schramowski; Matthias Aßenmacher; Kristian Kersting
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2311.01544, Pages 1-20, arXiv, 2023.

Abstract

Large Language Models (LLMs) have reshaped natural language processing with their im- pressive capabilities. However, their ever- increasing size has raised concerns about their effective deployment and the need for LLM compression. This study introduces the Diver- gent Token Metrics (DTMs), a novel approach to assessing compressed LLMs, addressing the limitations of traditional perplexity or accu- racy measures that fail to accurately reflect text generation quality. DTMs measure token di- vergences that allow deeper insights into the subtleties of model compression, in particu- lar, when evaluating components’ impacts in- dividually. Utilizing the First Divergent To- ken Metric (FDTM) in model sparsification re- veals that 25% of all attention components can be pruned beyond 90% on the Llama-2 model family, still keeping SOTA performance. For quantization, FDTM suggests that more than 80% of the parameters can be naively trans- formed to int8 without special outlier manage- ment. These evaluations indicate the neces- sity of choosing appropriate compressions for parameters individually—and that FDTM can identify those—while standard metrics result in deteriorated outcomes.

More links