Self-attention mechanisms revolutionize sequence modeling by allowing elements to interact dynamically. This powerful technique computes the importance of other elements for each element, enabling adaptive focus and modeling of long-range dependencies in various applications.
Scaled dot-product attention, the core of self-attention, efficiently computes similarities between query and key vectors. Multi-head attention in transformers further enhances model capacity by employing parallel attention mechanisms, each focusing on different aspects of the input.
Self-Attention Mechanisms
Concept of self-attention
- Self-attention mechanism allows elements in a sequence to interact dynamically capturing contextual relationships
- Computes importance of other elements for each element in the sequence enabling adaptive focus
- Enables modeling of long-range dependencies overcoming limitations of recurrent neural networks (RNNs)
- Key components include Query, Key, and Value vectors derived from input representations
- Process involves computing similarity between query and key vectors then weighting value vectors accordingly
- Widely applied in natural language processing (machine translation), computer vision (image captioning), and speech recognition (audio transcription)
Scaled dot-product attention mechanism
- Formula: $Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$
- Components: Q (Query matrix), K (Key matrix), V (Value matrix), $d_k$ (dimension of key vectors)
- Implementation steps:
- Compute dot product of query and key matrices
- Scale result by $\frac{1}{\sqrt{d_k}}$ to stabilize gradients
- Apply softmax function to obtain attention weights
- Multiply result with value matrix to get final output
- Computational complexity: Time $O(n^2d)$ for sequence length n and embedding dimension d, Space $O(n^2)$ for attention weights
- Advantages include efficient matrix multiplication and stable gradients due to scaling factor
Multi-head attention in transformers
- Parallel attention mechanisms use different learned linear projections for queries, keys, and values
- Typically employs 8 to 16 heads each focusing on different aspects of input (syntactic, semantic)
- Process:
- Create linear projections of input for each head
- Apply scaled dot-product attention to each head independently
- Concatenate outputs from all heads
- Apply final linear transformation to produce output
- Allows model to jointly attend to information from different representation subspaces enhancing model capacity
- Dimension of each head: $d_{model} / h$, where h is number of heads
- Computational cost remains similar to single-head attention due to reduced dimensionality per head
Interpretation of attention weights
- Higher weights indicate stronger relationships between elements reflecting importance of context for each token
- Visualization techniques include heatmaps of attention weights and attention flow graphs
- Analysis of attention patterns helps identify linguistic phenomena (coreference resolution) and understand model behavior
- Visualization tools: BertViz for BERT models, Tensor2tensor library for Transformer visualizations
- Applications include model debugging, improving interpretability, and identifying biases in the model
- Limitations: Attention ≠ explanation, careful interpretation of visualizations required to avoid misleading conclusions