LLaMA: Open and Efficient Foundation Language Models
LLaMA (Large Language Model Meta AI) is a series of foundational language models developed by Meta AI. The LLaMA models are designed to be efficient and effective for a wide range of natural language processing tasks. The LLaMA family includes models with various parameter sizes, allowing researchers and developers to choose the model that best fits their needs in terms of performance and computational resources.
Let’s take a look at the architecture of LLaMA and how it builds upon the original GPT architecture.
GPT-style Decoder-Only Transformer: A Technical Overview
1. Architecture Overview
The GPT architecture is a stacked decoder-only Transformer trained for causal language modeling. Unlike the original Transformer which includes both encoder and decoder, GPT uses only the decoder stack, modified with causal (autoregressive) masking to prevent information flow from future tokens.
- Input: A sequence of token indices $x = [x_1, \dots, x_n]$
- Output: A probability distribution over the vocabulary for the next token $x_{t+1}$
- Objective: Maximize the log-likelihood
2. Model Components
2.1 Token Embedding
Each input token $x_t$ is embedded into a $d_{\text{model}}$-dimensional vector using a learned embedding matrix $E \in \mathbb{R}^{V \times d_{\text{model}}}$:
\[\mathbf{X} \in \mathbb{R}^{n \times d_{\text{model}}}\]2.2 Positional Encoding
GPT uses learned absolute positional embeddings:
\[\mathbf{P} \in \mathbb{R}^{n \times d_{\text{model}}}\]Final input to the first layer:
\[\mathbf{H}_0 = \mathbf{X} + \mathbf{P}\]LLaMA replaces this with rotary positional encoding (RoPE).
2.3 Transformer Block (per layer $\ell$)
Each layer contains:
- Multi-head causal self-attention
- Feed-forward network (FFN)
- Residual connections and LayerNorm (post-norm)
a) Causal Self-Attention
For each head:
\[Q = H_\ell W^Q, \quad K = H_\ell W^K, \quad V = H_\ell W^V\]Apply scaled dot-product attention with causal mask $M$:
\[\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} + M \right)V\]Concatenate heads and apply output projection:
\[\text{MultiHead}(H_\ell) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O\]b) Feed-Forward Network (FFN)
Standard FFN uses GELU activation:
\[\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x)\]c) Residual & LayerNorm (Post-Norm)
GPT uses post-normalization:
\[x = x + \text{Sublayer}(\text{LayerNorm}(x))\]2.4 Output Layer
Final hidden state $H_L \in \mathbb{R}^{n \times d_{\text{model}}}$ is projected to logits over the vocabulary:
\[\text{logits} = H_L \cdot E^\top\]GPT uses weight tying: same $E$ is used for input embedding and output projection.
LLaMA: Open and Efficient Foundation Language Models
Now we move to LLaMA, which is based on the GPT architecture but with some modifications and optimizations.
LLaMA is a decoder-only Transformer model, similar to the original GPT architecture. It is designed for autoregressive language modeling, meaning it predicts the next token in a sequence given the previous tokens. The architecture consists of multiple layers of Transformer blocks, each containing self-attention and feed-forward networks.
1. Overview
LLaMA (Large Language Model Meta AI) is a family of decoder-only autoregressive Transformers designed to be compute-efficient, open, and competitive with much larger models like GPT-3 and PaLM.
- Architecture: GPT-style, decoder-only Transformer
- Objective: Causal language modeling
- Sizes: 7B, 13B, 33B, 65B
- Training tokens: 1T (7B/13B), 1.4T (33B/65B)
- Data: Only public datasets (CommonCrawl, Wikipedia, ArXiv, GitHub, etc.)
- Innovations: RoPE, SwiGLU, RMSNorm, Chinchilla-inspired scaling
2. Core Architecture
LLaMA uses a stack of decoder-only Transformer blocks with architectural improvements. Each Transformer block consists of:
- Pre-normalized input
- Rotary Positional Embedding (RoPE)
- Causal Self-Attention
- SwiGLU Feedforward Network
- Residual connections
- RMSNorm
2.1 Token and Positional Embeddings
Traditional Transformers add absolute positional embeddings to the input token embeddings:
\[\text{Input} = E_{\text{token}} + E_{\text{position}}\]This is done in:
- Transformer (Vaswani et al., 2017): sinusoidal
- GPT-2/3: learned position embeddings
2.1.1 Problem with Absolute Positional Embeddings
- Fixed sinusoidal embeddings are not learnable
- Learned embeddings are tied to max training length
- Both encode absolute positions but not relative distances
2.1.2 RoPE: Rotary Positional Embedding
LLaMA uses Rotary Positional Embedding (RoPE), introduced by Su et al., 2021.
Instead of adding positions to inputs, RoPE rotates the query and key vectors in attention:
\[Q_{\text{rot}} = \text{RoPE}(Q), \quad K_{\text{rot}} = \text{RoPE}(K)\]Each 2D subspace is rotated as:
\[\text{RoPE}\left( \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}, \theta \right) = R(\theta) \cdot \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} x_1 \cos \theta - x_2 \sin \theta \\ x_1 \sin \theta + x_2 \cos \theta \end{bmatrix}\]This rotation encodes position directly into the dot product:
\[A_{ij} = \langle \text{RoPE}(Q_i), \text{RoPE}(K_j) \rangle\]2.1.3 What is $R(i)$ in RoPE?
Each query/key vector $x \in \mathbb{R}^d$ is split into $d/2$ 2D components:
\[x = \left[ \begin{bmatrix} x_0 \\ x_1 \end{bmatrix}, \begin{bmatrix} x_2 \\ x_3 \end{bmatrix}, \dots \right]\]RoPE rotates each pair using a standard 2D rotation matrix:
\[R_k(i) = \begin{bmatrix} \cos(\theta_i^{(k)}) & -\sin(\theta_i^{(k)}) \\ \sin(\theta_i^{(k)}) & \cos(\theta_i^{(k)}) \end{bmatrix}\]Where the angle depends on the position $i$ and dimension $k$:
\[\theta_i^{(k)} = \frac{i}{10000^{2k/d}}\]The full $R(i)$ is a block-diagonal matrix:
\[R(i) = \text{diag}(R_0(i), R_1(i), \dots, R_{d/2-1}(i))\]RoPE applies:
\[\text{RoPE}(x)_k = \begin{bmatrix} x_{2k} \cos \theta - x_{2k+1} \sin \theta \\ x_{2k} \sin \theta + x_{2k+1} \cos \theta \end{bmatrix}\]This rotation preserves dot products and enables relative position encoding in attention.
2.1.4 Why is RoPE relative?
Let $q_i$ and $k_j$ be the query and key vectors at positions $i$ and $j$.
RoPE rotates them with position-dependent matrices $R(i)$ and $R(j)$:
\[\text{RoPE}(q_i) = R(i) \cdot q_i, \quad \text{RoPE}(k_j) = R(j) \cdot k_j\]Then the attention score becomes:
\[A_{ij} = \langle R(i) q_i, R(j) k_j \rangle = \langle q_i, R(i)^\top R(j) k_j \rangle\]where $A_{ij}$ is
\[A_{ij} = \text{Attention}(q_i, k_j) = \frac{\langle q_i, k_j \rangle}{\sqrt{d_k}}\]Let:
\[R(i, j) := R(i)^\top R(j)\]Then:
\[A_{ij} = \langle q_i, R(i, j) k_j \rangle\]From sinusoidal construction, $R(i)^\top R(j)$ depends only on the relative offset:
\[R(i)^\top R(j) = R(j - i)\]So the attention score is:
\[A_{ij} = \langle q_i, R(j - i) k_j \rangle\]This proves that RoPE encodes relative position — the attention depends only on the difference $j - i$, not on $i$ or $j$ individually.
2.1.5 Advantages of RoPE
Feature | Absolute (sin/cos or learned) | RoPE |
---|---|---|
Relative position modeling | ❌ No | ✅ Yes |
Extrapolation to long seqs | ❌ Poor | ✅ Good |
Parameter count | ✅ 0 (fixed) or ❌ learned | ✅ 0 (rotation) |
Efficiency | ✅ Yes | ✅ Yes |
Used in | Transformer, GPT-3 | ✅ LLaMA, ChatGLM, GPT-NeoX |
2.2 Transformer Block (Layer $\ell$)
Each of the $L$ layers is defined as:
\[x \leftarrow x + \text{Attention}(\text{RMSNorm}(x))\] \[x \leftarrow x + \text{FFN}(\text{RMSNorm}(x))\]2.2.1 GPT-Style Autoregressive Decoder
(1) Original Transformer Decoder (Vaswani et al., 2017)
In encoder-decoder Transformers (e.g., for translation), the decoder has:
- Self-attention (causal masked)
- Cross-attention (attends to encoder outputs)
- Feedforward network (FFN)
Each block:
\[x = x + \text{SelfAttention}(\text{Norm}(x))\] \[x = x + \text{CrossAttention}(\text{Norm}(x))\] \[x = x + \text{FFN}(\text{Norm}(x))\](2) GPT Simplifies to Decoder-Only
GPT removes the encoder and cross-attention:
- No encoder context
- Only causal self-attention and FFN
Each GPT block:
\[x = x + \text{SelfAttention}(\text{Norm}(x))\] \[x = x + \text{FFN}(\text{Norm}(x))\](3) Autoregressive Causal Mask
GPT is trained to predict the next token:
\[\mathcal{L} = \sum_{t=1}^{n} \log P(x_t \mid x_{<t})\]Attention mask:
\[\text{Mask}_{i,j} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}\]Prevents attending to future tokens.
(4) Summary Table
Component | Encoder-Decoder Transformer | GPT-Style Decoder |
---|---|---|
Cross-Attention | ✅ Yes | ❌ No |
Encoder Needed | ✅ Yes | ❌ No |
Causal Masking | ✅ Yes | ✅ Yes |
Direction | Bidirectional (via encoder) | ✅ Left-to-right only |
Output | Full sequence | ✅ One token at a time |
Used in | BERT, T5, original Transformer | ✅ GPT-2/3, LLaMA |
GPT-style decoder is simpler, autoregressive, and highly scalable — the foundation of LLaMA.
2.2.2 RMSNorm
(1) Original: Layer Normalization
Before RMSNorm, the output is normalized by LayerNorm, also called post-norm.
LayerNorm (Ba et al., 2016) is used in both encoder and decoder of the original Transformer and in GPT-2/GPT-3. It normalizes activations per token across hidden dimensions.
For a vector $x \in \mathbb{R}^d$, LayerNorm is:
\[\text{LayerNorm}(x) = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta\]Where:
- $\mu = \frac{1}{d} \sum_{i=1}^{d} x_i$, mean of the vector
- $\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2$, variance of the vector
- $\gamma$, $\beta$ are learnable affine parameters
- $\epsilon$ is a small constant for numerical stability
- $d$ is the hidden dimension size
(2) Problem with LayerNorm
- Requires both mean and variance computation
- Sensitive to numerical precision (especially in bfloat16)
- Slower and more complex
(3) RMSNorm
Now we are doing pre-norm with RMSNorm.
RMSNorm normalizes via the root mean square without subtracting the mean (removes mean subtraction and uses only root mean square):
\[\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d} \sum_{i=1}^{d} x_i^2 + \epsilon}} \cdot \gamma\]- No centering
- Only one learned parameter $\gamma$
- Used in pre-norm form in LLaMA
(4) Advantages of RMSNorm
Feature | LayerNorm | RMSNorm |
---|---|---|
Mean subtraction | ✅ Yes | ❌ No |
Variance norm | ✅ Yes | ❌ No (uses RMS only) |
Trainable scale | $\gamma$, $\beta$ | $\gamma$ only |
Efficiency | ❌ Slower | ✅ Faster |
Stability (bfloat) | ❌ May cause instability | ✅ More numerically stable |
Simplicity | ❌ Complex stats | ✅ Simpler |
(5) Post-Norm and Pre-Norm
Post-Norm, used in original Transformer, GPT 2/3.
Normalization is applied after each sublayer:
\[x = x + \text{LayerNorm}(x + \text{Sublayer}(x))\]Pre-Norm, used in LLaMA, T5, and other models.
Normalization is applied before each sublayer:
\[x = x + \text{Sublayer}(\text{Norm}(x))\]where $\text{Norm}(x)$ in LLaMA is RMSNorm,
\[x = x + \text{Attention}(\text{RMSNorm}(x))\] \[x = x + \text{FFN}(\text{RMSNorm}(x))\]Why was Post-Norm used originally?
- The original Transformer (Vaswani et al., 2017) used post-norm with LayerNorm, which worked fine for shallow models (6 layers).
- GPT-2/3 continued this for consistency.
But as models became deeper, post-norm showed problems.
Why Pre-Norm Is Better for Deep Transformers?
-
Improved Gradient Flow
- Pre-Norm allows gradients to propagate more directly through residual paths.
- Enables training of deeper networks (e.g., 100+ layers) without vanishing gradients.
-
Better Stability for Mixed-Precision Training
- Post-Norm can have large intermediate values before normalization — problematic in bfloat16/fp16.
- Pre-Norm constrains activations earlier, improving numerical robustness.
-
More Robust Optimization
- Pre-Norm works better with AdamW and weight decay.
- Smooths the training dynamics, especially for large-scale models like LLaMA.
Summary:
LaMA uses RMSNorm + pre-norm to ensure stable training at scale, better gradient flow, and numerical robustness under mixed precision. This is more effective than the original Transformer’s post-norm LayerNorm, especially for deep and large autoregressive models.
Feature | Post-Norm | Pre-Norm |
---|---|---|
Equation | $x = \text{Norm}(x + \text{Sublayer}(x))$ | $x = x + \text{Sublayer}(\text{Norm}(x))$ |
Used in | Transformer, GPT-2/3 | LLaMA, GPT-J, T5 |
Gradient Flow | ❌ Risk of vanishing | ✅ Stable |
Deep Model Support | ❌ Limited | ✅ Supports 100+ layers |
Mixed Precision | ❌ Unstable in bfloat16/fp16 | ✅ Robust |
Pairing with RMSNorm | ❌ Rare | ✅ Ideal |
2.2.3 Self-Attention with RoPE
Causal self-attention is computed as:
\[\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{\text{RoPE}(Q) \cdot \text{RoPE}(K)^\top}{\sqrt{d_k}} + M \right) V\]- $Q$, $K$, $V$ are projections of the normalized input.
- RoPE encodes relative positional information via rotation in complex space.
- $M$ is the causal mask to prevent attention to future tokens.
2.2.4 Feedforward Network and Activation Function(SwiGLU)
The original Transformer and GPT models use a two-layer feedforward network (FFN) in each block:
\[\text{FFN}(x) = W_2 \cdot \phi(W_1 x)\]Where $\phi$ is a nonlinear activation function:
- ReLU in Transformer
- GELU in GPT-2/3
(1) Original: ReLU and GELU
- Transformer uses ReLU:
- GPT-3 uses GELU:
Where $\Phi(x)$ is the cumulative distribution function (CDF) of a standard Gaussian.
(2) Problem with ReLU / GELU
- ReLU is simple but can be too sparse and under-expressive for large models.
- GELU is smoother than ReLU but:
- Still ungated: lacks multiplicative interaction
- Still single-branch: doesn’t separate flow of control and content
- Lacks capacity for fine-grained regulation across dimensions
(3) SwiGLU
LLaMA uses SwiGLU (Shazeer, 2020), a gated activation function combining Swish and GLU:
\[\text{SwiGLU}(x_1, x_2) = \text{Swish}(x_2) \cdot x_1\]With:
\[\text{Swish}(x) = x \cdot \sigma(x)\]where $\sigma(x)$ is the sigmoid function:
\[\sigma(x) = \frac{1}{1 + e^{-x}}\]Figure: Visualization of SwiGLU activation function compared to other activation functions
So the FFN becomes:
\[\text{FFN}(x) = W_3 \cdot \left( \text{SwiGLU}(W_1 x, W_2 x) \right)\]Where:
- $W_1, W_2$ project $x$ to two intermediate channels
- $W_3$ projects back to $d_{\text{model}}$
(4) Advantages of SwiGLU
Feature | ReLU / GELU | SwiGLU |
---|---|---|
Gating | ❌ None | ✅ Yes (learned) |
Interaction | ❌ Element-wise only | ✅ Multiplicative, gated |
Gradient Flow | ❌ Sparse (ReLU) | ✅ Smooth (Swish) |
Expressiveness | ❌ Limited | ✅ High |
Used in | GPT, BERT, Transformer | ✅ LLaMA, PaLM, T5.1.1 |
2.3 Chinchilla Scaling Laws
2.3.1 Original Assumption: Bigger = Better
Earlier trends (e.g., GPT-3) focused on scaling model size:
- GPT-3: 175B parameters, trained on only ~300B tokens
- This assumed more parameters → better performance
2.3.2 Problem with GPT-3 Scaling
- GPT-3 was undertrained
- Too few tokens for such a large model
- Result: suboptimal performance, inefficient compute
2.3.3 Chinchilla Scaling Law
Hoffmann et al. (2022) proposed:
Optimal performance = smaller model + more data
Scaling law:
\[N \propto D^{0.73}\]Where:
- $N$ = model parameters
- $D$ = training tokens
2.3.4 LLaMA Follows Chinchilla
Model | Parameters | Training Tokens |
---|---|---|
LLaMA-7B | 7B | 1.0T |
LLaMA-13B | 13B | 1.0T |
LLaMA-33B | 33B | 1.4T |
LLaMA-65B | 65B | 1.4T |
- LLaMA is trained longer, not just made bigger
- LLaMA-13B outperforms GPT-3 (175B) on many benchmarks
2.3.5 Benefits of Chinchilla-style Scaling
Feature | GPT-style (Kaplan) | Chinchilla-style (LLaMA) |
---|---|---|
Assumes | Bigger = better | Data ↔ size balance |
Token count | ~300B | 1.0T – 1.4T |
Compute efficiency | ❌ Suboptimal | ✅ Compute-optimal |
Sample efficiency | ❌ Lower | ✅ Higher |
Model size vs quality | ❌ Larger for same quality | ✅ Smaller = better quality |
2.4 Output Projection
After the final layer, the output is projected to vocabulary logits:
\[\text{logits} = H_L \cdot E^\top\]Where:
- $H_L \in \mathbb{R}^{n \times d_{\text{model}}}$ is the final hidden state.
- $E$ is the shared token embedding matrix (weight tying).
3. Model Sizes
Model | Layers ($L$) | Hidden dim ($d_{\text{model}}$) | Heads | FFN dim ($d_{\text{ffn}}$) | Parameters |
---|---|---|---|---|---|
LLaMA-7B | 32 | 4096 | 32 | 11008 | 7B |
LLaMA-13B | 40 | 5120 | 40 | 13824 | 13B |
LLaMA-33B | 60 | 6656 | 52 | 17920 | 33B |
LLaMA-65B | 80 | 8192 | 64 | 22016 | 65B |
The FFN dimension uses approximately:
\[d_{\text{ffn}} \approx \frac{2}{3} \cdot 4d_{\text{model}}\]4. Training Setup
- Optimizer: AdamW
- $\beta_1 = 0.9$, $\beta_2 = 0.95$
- Weight decay = 0.1
- Learning rate schedule: cosine decay with warmup
- Sequence length: 2048 tokens
- Precision: bfloat16
- Memory and compute optimization:
- Activation checkpointing
- FlashAttention or xformers-style attention kernels
- Tensor parallelism and sequence parallelism