Skip to main content Skip to main navigation

Publication

On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation

Nghiem Tuong Diep; Huy Nguyen; Chau Nguyen; Minh Le; Ho Minh Duy Nguyen; Daniel Sonntag; Mathias Niepert; Nhat Ho
In: Aarti Singh; Maryam Fazel; Daniel Hsu; Simon Lacoste-Julien; Felix Berkenkamp; Tegan Maharaj; Kiri Wagstaff; Jerry Zhu (Hrsg.). Proceedings of the 42nd International Conference on Machine Learning. International Conference on Machine Learning (ICML-2025), Pages 13713-13745, Proceedings of Machine Learning Research, Vol. 267, PMLR, 7/2025.

Abstract

LLaMA-Adapter has recently emerged as an efficient fine-tuning technique for LLaMA models, leveraging zero-initialized attention to stabilize training and enhance performance. However, despite its empirical success, the theoretical foundations of zero-initialized attention remain largely unexplored. In this paper, we provide a rigorous theoretical analysis, establishing a connection between zero-initialized attention and mixture-of-expert models. We prove that both linear and non-linear prompts, along with gating functions, can be optimally estimated, with non-linear prompts offering greater flexibility for future applications. Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and adaptability of zero-initialized attention.

Projects

More links