11 minute read

LLaMA (Large Language Model Meta AI) is a series of foundational language models developed by Meta AI. The LLaMA models are designed to be efficient and effective for a wide range of natural language processing tasks. The LLaMA family includes models with various parameter sizes, allowing researchers and developers to choose the model that best fits their needs in terms of performance and computational resources.

Let’s take a look at the architecture of LLaMA and how it builds upon the original GPT architecture.

GPT-style Decoder-Only Transformer: A Technical Overview

1. Architecture Overview

The GPT architecture is a stacked decoder-only Transformer trained for causal language modeling. Unlike the original Transformer which includes both encoder and decoder, GPT uses only the decoder stack, modified with causal (autoregressive) masking to prevent information flow from future tokens.

  • Input: A sequence of token indices $x = [x_1, \dots, x_n]$
  • Output: A probability distribution over the vocabulary for the next token $x_{t+1}$
  • Objective: Maximize the log-likelihood
\[\mathcal{L} = \sum_{t=1}^{n} \log P(x_t \mid x_{<t})\]

2. Model Components

2.1 Token Embedding

Each input token $x_t$ is embedded into a $d_{\text{model}}$-dimensional vector using a learned embedding matrix $E \in \mathbb{R}^{V \times d_{\text{model}}}$:

\[\mathbf{X} \in \mathbb{R}^{n \times d_{\text{model}}}\]

2.2 Positional Encoding

GPT uses learned absolute positional embeddings:

\[\mathbf{P} \in \mathbb{R}^{n \times d_{\text{model}}}\]

Final input to the first layer:

\[\mathbf{H}_0 = \mathbf{X} + \mathbf{P}\]

LLaMA replaces this with rotary positional encoding (RoPE).

2.3 Transformer Block (per layer $\ell$)

Each layer contains:

  • Multi-head causal self-attention
  • Feed-forward network (FFN)
  • Residual connections and LayerNorm (post-norm)

a) Causal Self-Attention

For each head:

\[Q = H_\ell W^Q, \quad K = H_\ell W^K, \quad V = H_\ell W^V\]

Apply scaled dot-product attention with causal mask $M$:

\[\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} + M \right)V\]

Concatenate heads and apply output projection:

\[\text{MultiHead}(H_\ell) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O\]

b) Feed-Forward Network (FFN)

Standard FFN uses GELU activation:

\[\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x)\]

c) Residual & LayerNorm (Post-Norm)

GPT uses post-normalization:

\[x = x + \text{Sublayer}(\text{LayerNorm}(x))\]

2.4 Output Layer

Final hidden state $H_L \in \mathbb{R}^{n \times d_{\text{model}}}$ is projected to logits over the vocabulary:

\[\text{logits} = H_L \cdot E^\top\]

GPT uses weight tying: same $E$ is used for input embedding and output projection.

LLaMA: Open and Efficient Foundation Language Models

Now we move to LLaMA, which is based on the GPT architecture but with some modifications and optimizations.

LLaMA is a decoder-only Transformer model, similar to the original GPT architecture. It is designed for autoregressive language modeling, meaning it predicts the next token in a sequence given the previous tokens. The architecture consists of multiple layers of Transformer blocks, each containing self-attention and feed-forward networks.

1. Overview

LLaMA (Large Language Model Meta AI) is a family of decoder-only autoregressive Transformers designed to be compute-efficient, open, and competitive with much larger models like GPT-3 and PaLM.

  • Architecture: GPT-style, decoder-only Transformer
  • Objective: Causal language modeling
  • Sizes: 7B, 13B, 33B, 65B
  • Training tokens: 1T (7B/13B), 1.4T (33B/65B)
  • Data: Only public datasets (CommonCrawl, Wikipedia, ArXiv, GitHub, etc.)
  • Innovations: RoPE, SwiGLU, RMSNorm, Chinchilla-inspired scaling

2. Core Architecture

LLaMA uses a stack of decoder-only Transformer blocks with architectural improvements. Each Transformer block consists of:

  • Pre-normalized input
  • Rotary Positional Embedding (RoPE)
  • Causal Self-Attention
  • SwiGLU Feedforward Network
  • Residual connections
  • RMSNorm

2.1 Token and Positional Embeddings

Traditional Transformers add absolute positional embeddings to the input token embeddings:

\[\text{Input} = E_{\text{token}} + E_{\text{position}}\]

This is done in:

  • Transformer (Vaswani et al., 2017): sinusoidal
  • GPT-2/3: learned position embeddings

2.1.1 Problem with Absolute Positional Embeddings

  • Fixed sinusoidal embeddings are not learnable
  • Learned embeddings are tied to max training length
  • Both encode absolute positions but not relative distances

2.1.2 RoPE: Rotary Positional Embedding

LLaMA uses Rotary Positional Embedding (RoPE), introduced by Su et al., 2021.

Instead of adding positions to inputs, RoPE rotates the query and key vectors in attention:

\[Q_{\text{rot}} = \text{RoPE}(Q), \quad K_{\text{rot}} = \text{RoPE}(K)\]

Each 2D subspace is rotated as:

\[\text{RoPE}\left( \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}, \theta \right) = R(\theta) \cdot \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} x_1 \cos \theta - x_2 \sin \theta \\ x_1 \sin \theta + x_2 \cos \theta \end{bmatrix}\]

This rotation encodes position directly into the dot product:

\[A_{ij} = \langle \text{RoPE}(Q_i), \text{RoPE}(K_j) \rangle\]

2.1.3 What is $R(i)$ in RoPE?

Each query/key vector $x \in \mathbb{R}^d$ is split into $d/2$ 2D components:

\[x = \left[ \begin{bmatrix} x_0 \\ x_1 \end{bmatrix}, \begin{bmatrix} x_2 \\ x_3 \end{bmatrix}, \dots \right]\]

RoPE rotates each pair using a standard 2D rotation matrix:

\[R_k(i) = \begin{bmatrix} \cos(\theta_i^{(k)}) & -\sin(\theta_i^{(k)}) \\ \sin(\theta_i^{(k)}) & \cos(\theta_i^{(k)}) \end{bmatrix}\]

Where the angle depends on the position $i$ and dimension $k$:

\[\theta_i^{(k)} = \frac{i}{10000^{2k/d}}\]

The full $R(i)$ is a block-diagonal matrix:

\[R(i) = \text{diag}(R_0(i), R_1(i), \dots, R_{d/2-1}(i))\]

RoPE applies:

\[\text{RoPE}(x)_k = \begin{bmatrix} x_{2k} \cos \theta - x_{2k+1} \sin \theta \\ x_{2k} \sin \theta + x_{2k+1} \cos \theta \end{bmatrix}\]

This rotation preserves dot products and enables relative position encoding in attention.

2.1.4 Why is RoPE relative?

Let $q_i$ and $k_j$ be the query and key vectors at positions $i$ and $j$.

RoPE rotates them with position-dependent matrices $R(i)$ and $R(j)$:

\[\text{RoPE}(q_i) = R(i) \cdot q_i, \quad \text{RoPE}(k_j) = R(j) \cdot k_j\]

Then the attention score becomes:

\[A_{ij} = \langle R(i) q_i, R(j) k_j \rangle = \langle q_i, R(i)^\top R(j) k_j \rangle\]

where $A_{ij}$ is

\[A_{ij} = \text{Attention}(q_i, k_j) = \frac{\langle q_i, k_j \rangle}{\sqrt{d_k}}\]

Let:

\[R(i, j) := R(i)^\top R(j)\]

Then:

\[A_{ij} = \langle q_i, R(i, j) k_j \rangle\]

From sinusoidal construction, $R(i)^\top R(j)$ depends only on the relative offset:

\[R(i)^\top R(j) = R(j - i)\]

So the attention score is:

\[A_{ij} = \langle q_i, R(j - i) k_j \rangle\]

This proves that RoPE encodes relative position — the attention depends only on the difference $j - i$, not on $i$ or $j$ individually.

2.1.5 Advantages of RoPE

Feature Absolute (sin/cos or learned) RoPE
Relative position modeling ❌ No ✅ Yes
Extrapolation to long seqs ❌ Poor ✅ Good
Parameter count ✅ 0 (fixed) or ❌ learned ✅ 0 (rotation)
Efficiency ✅ Yes ✅ Yes
Used in Transformer, GPT-3 ✅ LLaMA, ChatGLM, GPT-NeoX

2.2 Transformer Block (Layer $\ell$)

Each of the $L$ layers is defined as:

\[x \leftarrow x + \text{Attention}(\text{RMSNorm}(x))\] \[x \leftarrow x + \text{FFN}(\text{RMSNorm}(x))\]

2.2.1 GPT-Style Autoregressive Decoder

(1) Original Transformer Decoder (Vaswani et al., 2017)

In encoder-decoder Transformers (e.g., for translation), the decoder has:

  1. Self-attention (causal masked)
  2. Cross-attention (attends to encoder outputs)
  3. Feedforward network (FFN)

Each block:

\[x = x + \text{SelfAttention}(\text{Norm}(x))\] \[x = x + \text{CrossAttention}(\text{Norm}(x))\] \[x = x + \text{FFN}(\text{Norm}(x))\]
(2) GPT Simplifies to Decoder-Only

GPT removes the encoder and cross-attention:

  • No encoder context
  • Only causal self-attention and FFN

Each GPT block:

\[x = x + \text{SelfAttention}(\text{Norm}(x))\] \[x = x + \text{FFN}(\text{Norm}(x))\]
(3) Autoregressive Causal Mask

GPT is trained to predict the next token:

\[\mathcal{L} = \sum_{t=1}^{n} \log P(x_t \mid x_{<t})\]

Attention mask:

\[\text{Mask}_{i,j} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}\]

Prevents attending to future tokens.

(4) Summary Table
Component Encoder-Decoder Transformer GPT-Style Decoder
Cross-Attention ✅ Yes ❌ No
Encoder Needed ✅ Yes ❌ No
Causal Masking ✅ Yes ✅ Yes
Direction Bidirectional (via encoder) ✅ Left-to-right only
Output Full sequence ✅ One token at a time
Used in BERT, T5, original Transformer ✅ GPT-2/3, LLaMA

GPT-style decoder is simpler, autoregressive, and highly scalable — the foundation of LLaMA.

2.2.2 RMSNorm

(1) Original: Layer Normalization

Before RMSNorm, the output is normalized by LayerNorm, also called post-norm.

LayerNorm (Ba et al., 2016) is used in both encoder and decoder of the original Transformer and in GPT-2/GPT-3. It normalizes activations per token across hidden dimensions.

For a vector $x \in \mathbb{R}^d$, LayerNorm is:

\[\text{LayerNorm}(x) = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta\]

Where:

  • $\mu = \frac{1}{d} \sum_{i=1}^{d} x_i$, mean of the vector
  • $\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2$, variance of the vector
  • $\gamma$, $\beta$ are learnable affine parameters
  • $\epsilon$ is a small constant for numerical stability
  • $d$ is the hidden dimension size
(2) Problem with LayerNorm
  • Requires both mean and variance computation
  • Sensitive to numerical precision (especially in bfloat16)
  • Slower and more complex
(3) RMSNorm

Now we are doing pre-norm with RMSNorm.

RMSNorm normalizes via the root mean square without subtracting the mean (removes mean subtraction and uses only root mean square):

\[\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d} \sum_{i=1}^{d} x_i^2 + \epsilon}} \cdot \gamma\]
  • No centering
  • Only one learned parameter $\gamma$
  • Used in pre-norm form in LLaMA
(4) Advantages of RMSNorm
Feature LayerNorm RMSNorm
Mean subtraction ✅ Yes ❌ No
Variance norm ✅ Yes ❌ No (uses RMS only)
Trainable scale $\gamma$, $\beta$ $\gamma$ only
Efficiency ❌ Slower ✅ Faster
Stability (bfloat) ❌ May cause instability ✅ More numerically stable
Simplicity ❌ Complex stats ✅ Simpler
(5) Post-Norm and Pre-Norm

Post-Norm, used in original Transformer, GPT 2/3.

Normalization is applied after each sublayer:

\[x = x + \text{LayerNorm}(x + \text{Sublayer}(x))\]

Pre-Norm, used in LLaMA, T5, and other models.

Normalization is applied before each sublayer:

\[x = x + \text{Sublayer}(\text{Norm}(x))\]

where $\text{Norm}(x)$ in LLaMA is RMSNorm,

\[x = x + \text{Attention}(\text{RMSNorm}(x))\] \[x = x + \text{FFN}(\text{RMSNorm}(x))\]

Why was Post-Norm used originally?

  • The original Transformer (Vaswani et al., 2017) used post-norm with LayerNorm, which worked fine for shallow models (6 layers).
  • GPT-2/3 continued this for consistency.

But as models became deeper, post-norm showed problems.

Why Pre-Norm Is Better for Deep Transformers?

  1. Improved Gradient Flow

    • Pre-Norm allows gradients to propagate more directly through residual paths.
    • Enables training of deeper networks (e.g., 100+ layers) without vanishing gradients.
  2. Better Stability for Mixed-Precision Training

    • Post-Norm can have large intermediate values before normalization — problematic in bfloat16/fp16.
    • Pre-Norm constrains activations earlier, improving numerical robustness.
  3. More Robust Optimization

    • Pre-Norm works better with AdamW and weight decay.
    • Smooths the training dynamics, especially for large-scale models like LLaMA.

Summary:

LaMA uses RMSNorm + pre-norm to ensure stable training at scale, better gradient flow, and numerical robustness under mixed precision. This is more effective than the original Transformer’s post-norm LayerNorm, especially for deep and large autoregressive models.

Feature Post-Norm Pre-Norm
Equation $x = \text{Norm}(x + \text{Sublayer}(x))$ $x = x + \text{Sublayer}(\text{Norm}(x))$
Used in Transformer, GPT-2/3 LLaMA, GPT-J, T5
Gradient Flow ❌ Risk of vanishing ✅ Stable
Deep Model Support ❌ Limited ✅ Supports 100+ layers
Mixed Precision ❌ Unstable in bfloat16/fp16 ✅ Robust
Pairing with RMSNorm ❌ Rare ✅ Ideal

2.2.3 Self-Attention with RoPE

Causal self-attention is computed as:

\[\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{\text{RoPE}(Q) \cdot \text{RoPE}(K)^\top}{\sqrt{d_k}} + M \right) V\]
  • $Q$, $K$, $V$ are projections of the normalized input.
  • RoPE encodes relative positional information via rotation in complex space.
  • $M$ is the causal mask to prevent attention to future tokens.

2.2.4 Feedforward Network and Activation Function(SwiGLU)

The original Transformer and GPT models use a two-layer feedforward network (FFN) in each block:

\[\text{FFN}(x) = W_2 \cdot \phi(W_1 x)\]

Where $\phi$ is a nonlinear activation function:

  • ReLU in Transformer
  • GELU in GPT-2/3
(1) Original: ReLU and GELU
  • Transformer uses ReLU:
\[\text{ReLU}(x) = \max(0, x)\]
  • GPT-3 uses GELU:
\[\text{GELU}(x) = x \cdot \Phi(x)\]

Where $\Phi(x)$ is the cumulative distribution function (CDF) of a standard Gaussian.

(2) Problem with ReLU / GELU
  • ReLU is simple but can be too sparse and under-expressive for large models.
  • GELU is smoother than ReLU but:
    • Still ungated: lacks multiplicative interaction
    • Still single-branch: doesn’t separate flow of control and content
  • Lacks capacity for fine-grained regulation across dimensions
(3) SwiGLU

LLaMA uses SwiGLU (Shazeer, 2020), a gated activation function combining Swish and GLU:

\[\text{SwiGLU}(x_1, x_2) = \text{Swish}(x_2) \cdot x_1\]

With:

\[\text{Swish}(x) = x \cdot \sigma(x)\]

where $\sigma(x)$ is the sigmoid function:

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

SwiGLU Activation Function Visualization

Figure: Visualization of SwiGLU activation function compared to other activation functions

So the FFN becomes:

\[\text{FFN}(x) = W_3 \cdot \left( \text{SwiGLU}(W_1 x, W_2 x) \right)\]

Where:

  • $W_1, W_2$ project $x$ to two intermediate channels
  • $W_3$ projects back to $d_{\text{model}}$
(4) Advantages of SwiGLU
Feature ReLU / GELU SwiGLU
Gating ❌ None ✅ Yes (learned)
Interaction ❌ Element-wise only ✅ Multiplicative, gated
Gradient Flow ❌ Sparse (ReLU) ✅ Smooth (Swish)
Expressiveness ❌ Limited ✅ High
Used in GPT, BERT, Transformer ✅ LLaMA, PaLM, T5.1.1

2.3 Chinchilla Scaling Laws

2.3.1 Original Assumption: Bigger = Better

Earlier trends (e.g., GPT-3) focused on scaling model size:

  • GPT-3: 175B parameters, trained on only ~300B tokens
  • This assumed more parameters → better performance

2.3.2 Problem with GPT-3 Scaling

  • GPT-3 was undertrained
  • Too few tokens for such a large model
  • Result: suboptimal performance, inefficient compute

2.3.3 Chinchilla Scaling Law

Hoffmann et al. (2022) proposed:

Optimal performance = smaller model + more data

Scaling law:

\[N \propto D^{0.73}\]

Where:

  • $N$ = model parameters
  • $D$ = training tokens

2.3.4 LLaMA Follows Chinchilla

Model Parameters Training Tokens
LLaMA-7B 7B 1.0T
LLaMA-13B 13B 1.0T
LLaMA-33B 33B 1.4T
LLaMA-65B 65B 1.4T
  • LLaMA is trained longer, not just made bigger
  • LLaMA-13B outperforms GPT-3 (175B) on many benchmarks

2.3.5 Benefits of Chinchilla-style Scaling

Feature GPT-style (Kaplan) Chinchilla-style (LLaMA)
Assumes Bigger = better Data ↔ size balance
Token count ~300B 1.0T – 1.4T
Compute efficiency ❌ Suboptimal ✅ Compute-optimal
Sample efficiency ❌ Lower ✅ Higher
Model size vs quality ❌ Larger for same quality ✅ Smaller = better quality

2.4 Output Projection

After the final layer, the output is projected to vocabulary logits:

\[\text{logits} = H_L \cdot E^\top\]

Where:

  • $H_L \in \mathbb{R}^{n \times d_{\text{model}}}$ is the final hidden state.
  • $E$ is the shared token embedding matrix (weight tying).

3. Model Sizes

Model Layers ($L$) Hidden dim ($d_{\text{model}}$) Heads FFN dim ($d_{\text{ffn}}$) Parameters
LLaMA-7B 32 4096 32 11008 7B
LLaMA-13B 40 5120 40 13824 13B
LLaMA-33B 60 6656 52 17920 33B
LLaMA-65B 80 8192 64 22016 65B

The FFN dimension uses approximately:

\[d_{\text{ffn}} \approx \frac{2}{3} \cdot 4d_{\text{model}}\]

4. Training Setup

  • Optimizer: AdamW
    • $\beta_1 = 0.9$, $\beta_2 = 0.95$
    • Weight decay = 0.1
  • Learning rate schedule: cosine decay with warmup
  • Sequence length: 2048 tokens
  • Precision: bfloat16
  • Memory and compute optimization:
    • Activation checkpointing
    • FlashAttention or xformers-style attention kernels
    • Tensor parallelism and sequence parallelism

5. Summary Formula

\[\text{LLaMA} = \text{Decoder-only Transformer} + \text{RoPE} + \text{SwiGLU} + \text{RMSNorm} + \text{Chinchilla Scaling}\]

6. References