Publication
Divergent Token Metrics: Measuring degradation to prune away LLM components - and optimize quantization
Björn Deiseroth; Max Meuer; Nikolas Gritsch; Constantin Eichenberg; Patrick Schramowski; Matthias Aßenmacher; Kristian Kersting
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2311.01544, Pages 1-20, arXiv, 2023.
Abstract
Large Language Models (LLMs) have reshaped
natural language processing with their im-
pressive capabilities. However, their ever-
increasing size has raised concerns about their
effective deployment and the need for LLM
compression. This study introduces the Diver-
gent Token Metrics (DTMs), a novel approach
to assessing compressed LLMs, addressing the
limitations of traditional perplexity or accu-
racy measures that fail to accurately reflect text
generation quality. DTMs measure token di-
vergences that allow deeper insights into the
subtleties of model compression, in particu-
lar, when evaluating components’ impacts in-
dividually. Utilizing the First Divergent To-
ken Metric (FDTM) in model sparsification re-
veals that 25% of all attention components can
be pruned beyond 90% on the Llama-2 model
family, still keeping SOTA performance. For
quantization, FDTM suggests that more than
80% of the parameters can be naively trans-
formed to int8 without special outlier manage-
ment. These evaluations indicate the neces-
sity of choosing appropriate compressions for
parameters individually—and that FDTM can
identify those—while standard metrics result
in deteriorated outcomes.
