Tucker Attention: A Unified Framework for Parameter-Efficient Self-Attention Mechanisms

Introduction The landscape of transformer-based architectures has witnessed substantial evolution in pursuit of computational efficiency. Self-attention mechanisms, foundational to modern large language models (LLMs) and vision transformers (ViTs), present a critical challenge: balancing parameter count with model performance. Recent approaches such as Group-Query Attention (GQA) and Multi-Head Latent Attention (MLA)