LLaMA: Open and Efficient Foundation Language Models

11 minute read

LLaMA (Large Language Model Meta AI) is a series of foundational language models developed by Meta AI. The LLaMA models are designed to be efficient and effective for a wide range of natural language processing tasks. The LLaMA family includes models with various parameter sizes, allowing researchers and developers to choose the model that best fits their needs in terms of performance and computational resources.

Let’s take a look at the architecture of LLaMA and how it builds upon the original GPT architecture.

GPT-style Decoder-Only Transformer: A Technical Overview

1. Architecture Overview

The GPT architecture is a stacked decoder-only Transformer trained for causal language modeling. Unlike the original Transformer which includes both encoder and decoder, GPT uses only the decoder stack, modified with causal (autoregressive) masking to prevent information flow from future tokens.

Input: A sequence of token indices $x = [x_1, \dots, x_n]$
Output: A probability distribution over the vocabulary for the next token $x_{t+1}$
Objective: Maximize the log-likelihood

\[\mathcal{L} = \sum_{t=1}^{n} \log P(x_t \mid x_{<t})\]

2. Model Components

2.1 Token Embedding

Each input token $x_t$ is embedded into a $d_{\text{model}}$-dimensional vector using a learned embedding matrix $E \in \mathbb{R}^{V \times d_{\text{model}}}$:

\[\mathbf{X} \in \mathbb{R}^{n \times d_{\text{model}}}\]

2.2 Positional Encoding

GPT uses learned absolute positional embeddings:

\[\mathbf{P} \in \mathbb{R}^{n \times d_{\text{model}}}\]

Final input to the first layer:

\[\mathbf{H}_0 = \mathbf{X} + \mathbf{P}\]

LLaMA replaces this with rotary positional encoding (RoPE).

2.3 Transformer Block (per layer $\ell$)

Each layer contains:

Multi-head causal self-attention
Feed-forward network (FFN)
Residual connections and LayerNorm (post-norm)

a) Causal Self-Attention

For each head:

\[Q = H_\ell W^Q, \quad K = H_\ell W^K, \quad V = H_\ell W^V\]

Apply scaled dot-product attention with causal mask $M$:

\[\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} + M \right)V\]

Concatenate heads and apply output projection:

\[\text{MultiHead}(H_\ell) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O\]

b) Feed-Forward Network (FFN)

Standard FFN uses GELU activation:

\[\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x)\]

c) Residual & LayerNorm (Post-Norm)

GPT uses post-normalization:

\[x = x + \text{Sublayer}(\text{LayerNorm}(x))\]

2.4 Output Layer

Final hidden state $H_L \in \mathbb{R}^{n \times d_{\text{model}}}$ is projected to logits over the vocabulary:

\[\text{logits} = H_L \cdot E^\top\]

GPT uses weight tying: same $E$ is used for input embedding and output projection.

LLaMA: Open and Efficient Foundation Language Models

Now we move to LLaMA, which is based on the GPT architecture but with some modifications and optimizations.

LLaMA is a decoder-only Transformer model, similar to the original GPT architecture. It is designed for autoregressive language modeling, meaning it predicts the next token in a sequence given the previous tokens. The architecture consists of multiple layers of Transformer blocks, each containing self-attention and feed-forward networks.

1. Overview

LLaMA (Large Language Model Meta AI) is a family of decoder-only autoregressive Transformers designed to be compute-efficient, open, and competitive with much larger models like GPT-3 and PaLM.

Architecture: GPT-style, decoder-only Transformer
Objective: Causal language modeling
Sizes: 7B, 13B, 33B, 65B
Training tokens: 1T (7B/13B), 1.4T (33B/65B)
Data: Only public datasets (CommonCrawl, Wikipedia, ArXiv, GitHub, etc.)
Innovations: RoPE, SwiGLU, RMSNorm, Chinchilla-inspired scaling

2. Core Architecture

LLaMA uses a stack of decoder-only Transformer blocks with architectural improvements. Each Transformer block consists of:

Pre-normalized input
Rotary Positional Embedding (RoPE)
Causal Self-Attention
SwiGLU Feedforward Network
Residual connections
RMSNorm

2.1 Token and Positional Embeddings

Traditional Transformers add absolute positional embeddings to the input token embeddings:

\[\text{Input} = E_{\text{token}} + E_{\text{position}}\]

This is done in:

Transformer (Vaswani et al., 2017): sinusoidal
GPT-2/3: learned position embeddings

2.1.1 Problem with Absolute Positional Embeddings

Fixed sinusoidal embeddings are not learnable
Learned embeddings are tied to max training length
Both encode absolute positions but not relative distances

2.1.2 RoPE: Rotary Positional Embedding

LLaMA uses Rotary Positional Embedding (RoPE), introduced by Su et al., 2021.

Instead of adding positions to inputs, RoPE rotates the query and key vectors in attention:

\[Q_{\text{rot}} = \text{RoPE}(Q), \quad K_{\text{rot}} = \text{RoPE}(K)\]

Each 2D subspace is rotated as:

\[\text{RoPE}\left( \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}, \theta \right) = R(\theta) \cdot \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} x_1 \cos \theta - x_2 \sin \theta \\ x_1 \sin \theta + x_2 \cos \theta \end{bmatrix}\]

This rotation encodes position directly into the dot product:

\[A_{ij} = \langle \text{RoPE}(Q_i), \text{RoPE}(K_j) \rangle\]

2.1.3 What is $R(i)$ in RoPE?

Each query/key vector $x \in \mathbb{R}^d$ is split into $d/2$ 2D components:

\[x = \left[ \begin{bmatrix} x_0 \\ x_1 \end{bmatrix}, \begin{bmatrix} x_2 \\ x_3 \end{bmatrix}, \dots \right]\]

RoPE rotates each pair using a standard 2D rotation matrix:

\[R_k(i) = \begin{bmatrix} \cos(\theta_i^{(k)}) & -\sin(\theta_i^{(k)}) \\ \sin(\theta_i^{(k)}) & \cos(\theta_i^{(k)}) \end{bmatrix}\]

Where the angle depends on the position $i$ and dimension $k$:

\[\theta_i^{(k)} = \frac{i}{10000^{2k/d}}\]

The full $R(i)$ is a block-diagonal matrix:

\[R(i) = \text{diag}(R_0(i), R_1(i), \dots, R_{d/2-1}(i))\]

RoPE applies:

\[\text{RoPE}(x)_k = \begin{bmatrix} x_{2k} \cos \theta - x_{2k+1} \sin \theta \\ x_{2k} \sin \theta + x_{2k+1} \cos \theta \end{bmatrix}\]

This rotation preserves dot products and enables relative position encoding in attention.

2.1.4 Why is RoPE relative?

Let $q_i$ and $k_j$ be the query and key vectors at positions $i$ and $j$.

RoPE rotates them with position-dependent matrices $R(i)$ and $R(j)$:

\[\text{RoPE}(q_i) = R(i) \cdot q_i, \quad \text{RoPE}(k_j) = R(j) \cdot k_j\]

Then the attention score becomes:

\[A_{ij} = \langle R(i) q_i, R(j) k_j \rangle = \langle q_i, R(i)^\top R(j) k_j \rangle\]

where $A_{ij}$ is

\[A_{ij} = \text{Attention}(q_i, k_j) = \frac{\langle q_i, k_j \rangle}{\sqrt{d_k}}\]

Let:

\[R(i, j) := R(i)^\top R(j)\]

Then:

\[A_{ij} = \langle q_i, R(i, j) k_j \rangle\]

From sinusoidal construction, $R(i)^\top R(j)$ depends only on the relative offset:

\[R(i)^\top R(j) = R(j - i)\]

So the attention score is:

\[A_{ij} = \langle q_i, R(j - i) k_j \rangle\]

This proves that RoPE encodes relative position — the attention depends only on the difference $j - i$, not on $i$ or $j$ individually.

2.1.5 Advantages of RoPE

Feature	Absolute (sin/cos or learned)	RoPE
Relative position modeling	❌ No	✅ Yes
Extrapolation to long seqs	❌ Poor	✅ Good
Parameter count	✅ 0 (fixed) or ❌ learned	✅ 0 (rotation)
Efficiency	✅ Yes	✅ Yes
Used in	Transformer, GPT-3	✅ LLaMA, ChatGLM, GPT-NeoX

2.2 Transformer Block (Layer $\ell$)

Each of the $L$ layers is defined as:

\[x \leftarrow x + \text{Attention}(\text{RMSNorm}(x))\] \[x \leftarrow x + \text{FFN}(\text{RMSNorm}(x))\]

2.2.1 GPT-Style Autoregressive Decoder

(1) Original Transformer Decoder (Vaswani et al., 2017)

In encoder-decoder Transformers (e.g., for translation), the decoder has:

Self-attention (causal masked)
Cross-attention (attends to encoder outputs)
Feedforward network (FFN)

Each block:

\[x = x + \text{SelfAttention}(\text{Norm}(x))\] \[x = x + \text{CrossAttention}(\text{Norm}(x))\] \[x = x + \text{FFN}(\text{Norm}(x))\]

(2) GPT Simplifies to Decoder-Only

GPT removes the encoder and cross-attention:

No encoder context
Only causal self-attention and FFN

Each GPT block:

\[x = x + \text{SelfAttention}(\text{Norm}(x))\] \[x = x + \text{FFN}(\text{Norm}(x))\]

(3) Autoregressive Causal Mask

GPT is trained to predict the next token:

\[\mathcal{L} = \sum_{t=1}^{n} \log P(x_t \mid x_{<t})\]

Attention mask:

\[\text{Mask}_{i,j} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}\]

Prevents attending to future tokens.

(4) Summary Table

Component	Encoder-Decoder Transformer	GPT-Style Decoder
Cross-Attention	✅ Yes	❌ No
Encoder Needed	✅ Yes	❌ No
Causal Masking	✅ Yes	✅ Yes
Direction	Bidirectional (via encoder)	✅ Left-to-right only
Output	Full sequence	✅ One token at a time
Used in	BERT, T5, original Transformer	✅ GPT-2/3, LLaMA

GPT-style decoder is simpler, autoregressive, and highly scalable — the foundation of LLaMA.

2.2.2 RMSNorm

(1) Original: Layer Normalization

Before RMSNorm, the output is normalized by LayerNorm, also called post-norm.

LayerNorm (Ba et al., 2016) is used in both encoder and decoder of the original Transformer and in GPT-2/GPT-3. It normalizes activations per token across hidden dimensions.

For a vector $x \in \mathbb{R}^d$, LayerNorm is:

\[\text{LayerNorm}(x) = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta\]

Where:

$\mu = \frac{1}{d} \sum_{i=1}^{d} x_i$, mean of the vector
$\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2$, variance of the vector
$\gamma$, $\beta$ are learnable affine parameters
$\epsilon$ is a small constant for numerical stability
$d$ is the hidden dimension size

(2) Problem with LayerNorm

Requires both mean and variance computation
Sensitive to numerical precision (especially in bfloat16)
Slower and more complex

(3) RMSNorm

Now we are doing pre-norm with RMSNorm.

RMSNorm normalizes via the root mean square without subtracting the mean (removes mean subtraction and uses only root mean square):

\[\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d} \sum_{i=1}^{d} x_i^2 + \epsilon}} \cdot \gamma\]

No centering
Only one learned parameter $\gamma$
Used in pre-norm form in LLaMA

(4) Advantages of RMSNorm

Feature	LayerNorm	RMSNorm
Mean subtraction	✅ Yes	❌ No
Variance norm	✅ Yes	❌ No (uses RMS only)
Trainable scale	$\gamma$, $\beta$	$\gamma$ only
Efficiency	❌ Slower	✅ Faster
Stability (bfloat)	❌ May cause instability	✅ More numerically stable
Simplicity	❌ Complex stats	✅ Simpler

(5) Post-Norm and Pre-Norm

Post-Norm, used in original Transformer, GPT 2/3.

Normalization is applied after each sublayer:

\[x = x + \text{LayerNorm}(x + \text{Sublayer}(x))\]

Pre-Norm, used in LLaMA, T5, and other models.

Normalization is applied before each sublayer:

\[x = x + \text{Sublayer}(\text{Norm}(x))\]

where $\text{Norm}(x)$ in LLaMA is RMSNorm,

\[x = x + \text{Attention}(\text{RMSNorm}(x))\] \[x = x + \text{FFN}(\text{RMSNorm}(x))\]

Why was Post-Norm used originally?

The original Transformer (Vaswani et al., 2017) used post-norm with LayerNorm, which worked fine for shallow models (6 layers).
GPT-2/3 continued this for consistency.

But as models became deeper, post-norm showed problems.

Why Pre-Norm Is Better for Deep Transformers?

Improved Gradient Flow
- Pre-Norm allows gradients to propagate more directly through residual paths.
- Enables training of deeper networks (e.g., 100+ layers) without vanishing gradients.
Better Stability for Mixed-Precision Training
- Post-Norm can have large intermediate values before normalization — problematic in bfloat16/fp16.
- Pre-Norm constrains activations earlier, improving numerical robustness.
More Robust Optimization
- Pre-Norm works better with AdamW and weight decay.
- Smooths the training dynamics, especially for large-scale models like LLaMA.

Summary:

LaMA uses RMSNorm + pre-norm to ensure stable training at scale, better gradient flow, and numerical robustness under mixed precision. This is more effective than the original Transformer’s post-norm LayerNorm, especially for deep and large autoregressive models.

Feature	Post-Norm	Pre-Norm
Equation	$x = \text{Norm}(x + \text{Sublayer}(x))$	$x = x + \text{Sublayer}(\text{Norm}(x))$
Used in	Transformer, GPT-2/3	LLaMA, GPT-J, T5
Gradient Flow	❌ Risk of vanishing	✅ Stable
Deep Model Support	❌ Limited	✅ Supports 100+ layers
Mixed Precision	❌ Unstable in bfloat16/fp16	✅ Robust
Pairing with RMSNorm	❌ Rare	✅ Ideal

2.2.3 Self-Attention with RoPE

Causal self-attention is computed as:

\[\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{\text{RoPE}(Q) \cdot \text{RoPE}(K)^\top}{\sqrt{d_k}} + M \right) V\]

$Q$, $K$, $V$ are projections of the normalized input.
RoPE encodes relative positional information via rotation in complex space.
$M$ is the causal mask to prevent attention to future tokens.

2.2.4 Feedforward Network and Activation Function(SwiGLU)

The original Transformer and GPT models use a two-layer feedforward network (FFN) in each block:

\[\text{FFN}(x) = W_2 \cdot \phi(W_1 x)\]

Where $\phi$ is a nonlinear activation function:

ReLU in Transformer
GELU in GPT-2/3

(1) Original: ReLU and GELU

Transformer uses ReLU:

\[\text{ReLU}(x) = \max(0, x)\]

GPT-3 uses GELU:

\[\text{GELU}(x) = x \cdot \Phi(x)\]

Where $\Phi(x)$ is the cumulative distribution function (CDF) of a standard Gaussian.

(2) Problem with ReLU / GELU

ReLU is simple but can be too sparse and under-expressive for large models.
GELU is smoother than ReLU but:
- Still ungated: lacks multiplicative interaction
- Still single-branch: doesn’t separate flow of control and content
Lacks capacity for fine-grained regulation across dimensions

(3) SwiGLU

LLaMA uses SwiGLU (Shazeer, 2020), a gated activation function combining Swish and GLU:

\[\text{SwiGLU}(x_1, x_2) = \text{Swish}(x_2) \cdot x_1\]

With:

\[\text{Swish}(x) = x \cdot \sigma(x)\]

where $\sigma(x)$ is the sigmoid function:

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

SwiGLU Activation Function Visualization

Figure: Visualization of SwiGLU activation function compared to other activation functions

So the FFN becomes:

\[\text{FFN}(x) = W_3 \cdot \left( \text{SwiGLU}(W_1 x, W_2 x) \right)\]

Where:

$W_1, W_2$ project $x$ to two intermediate channels
$W_3$ projects back to $d_{\text{model}}$

(4) Advantages of SwiGLU

Feature	ReLU / GELU	SwiGLU
Gating	❌ None	✅ Yes (learned)
Interaction	❌ Element-wise only	✅ Multiplicative, gated
Gradient Flow	❌ Sparse (ReLU)	✅ Smooth (Swish)
Expressiveness	❌ Limited	✅ High
Used in	GPT, BERT, Transformer	✅ LLaMA, PaLM, T5.1.1

2.3 Chinchilla Scaling Laws

2.3.1 Original Assumption: Bigger = Better

Earlier trends (e.g., GPT-3) focused on scaling model size:

GPT-3: 175B parameters, trained on only ~300B tokens
This assumed more parameters → better performance

2.3.2 Problem with GPT-3 Scaling

GPT-3 was undertrained
Too few tokens for such a large model
Result: suboptimal performance, inefficient compute

2.3.3 Chinchilla Scaling Law

Hoffmann et al. (2022) proposed:

Optimal performance = smaller model + more data

Scaling law:

\[N \propto D^{0.73}\]

Where:

$N$ = model parameters
$D$ = training tokens

2.3.4 LLaMA Follows Chinchilla

Model	Parameters	Training Tokens
LLaMA-7B	7B	1.0T
LLaMA-13B	13B	1.0T
LLaMA-33B	33B	1.4T
LLaMA-65B	65B	1.4T

LLaMA is trained longer, not just made bigger
LLaMA-13B outperforms GPT-3 (175B) on many benchmarks

2.3.5 Benefits of Chinchilla-style Scaling

Feature	GPT-style (Kaplan)	Chinchilla-style (LLaMA)
Assumes	Bigger = better	Data ↔ size balance
Token count	~300B	1.0T – 1.4T
Compute efficiency	❌ Suboptimal	✅ Compute-optimal
Sample efficiency	❌ Lower	✅ Higher
Model size vs quality	❌ Larger for same quality	✅ Smaller = better quality

2.4 Output Projection

After the final layer, the output is projected to vocabulary logits:

\[\text{logits} = H_L \cdot E^\top\]

Where:

$H_L \in \mathbb{R}^{n \times d_{\text{model}}}$ is the final hidden state.
$E$ is the shared token embedding matrix (weight tying).

3. Model Sizes

Model	Layers ($L$)	Hidden dim ($d_{\text{model}}$)	Heads	FFN dim ($d_{\text{ffn}}$)	Parameters
LLaMA-7B	32	4096	32	11008	7B
LLaMA-13B	40	5120	40	13824	13B
LLaMA-33B	60	6656	52	17920	33B
LLaMA-65B	80	8192	64	22016	65B

The FFN dimension uses approximately:

\[d_{\text{ffn}} \approx \frac{2}{3} \cdot 4d_{\text{model}}\]

4. Training Setup

Optimizer: AdamW
- $\beta_1 = 0.9$, $\beta_2 = 0.95$
- Weight decay = 0.1
Learning rate schedule: cosine decay with warmup
Sequence length: 2048 tokens
Precision: bfloat16
Memory and compute optimization:
- Activation checkpointing
- FlashAttention or xformers-style attention kernels
- Tensor parallelism and sequence parallelism

5. Summary Formula

\[\text{LLaMA} = \text{Decoder-only Transformer} + \text{RoPE} + \text{SwiGLU} + \text{RMSNorm} + \text{Chinchilla Scaling}\]

6. References

Share on

Twitter Facebook LinkedIn

刘昭宏 (Zhaohong Liu)

GPT-style Decoder-Only Transformer: A Technical Overview

1. Architecture Overview

2. Model Components

2.1 Token Embedding

2.2 Positional Encoding

2.3 Transformer Block (per layer $\ell$)

a) Causal Self-Attention

b) Feed-Forward Network (FFN)

c) Residual & LayerNorm (Post-Norm)

2.4 Output Layer

LLaMA: Open and Efficient Foundation Language Models

1. Overview

2. Core Architecture

2.1 Token and Positional Embeddings

2.1.1 Problem with Absolute Positional Embeddings

2.1.2 RoPE: Rotary Positional Embedding

2.1.3 What is $R(i)$ in RoPE?

2.1.4 Why is RoPE relative?

2.1.5 Advantages of RoPE

2.2 Transformer Block (Layer $\ell$)

2.2.1 GPT-Style Autoregressive Decoder

(1) Original Transformer Decoder (Vaswani et al., 2017)

(2) GPT Simplifies to Decoder-Only

(3) Autoregressive Causal Mask

(4) Summary Table

2.2.2 RMSNorm

(1) Original: Layer Normalization

(2) Problem with LayerNorm

(3) RMSNorm

(4) Advantages of RMSNorm

(5) Post-Norm and Pre-Norm

2.2.3 Self-Attention with RoPE

2.2.4 Feedforward Network and Activation Function(SwiGLU)

(1) Original: ReLU and GELU

(2) Problem with ReLU / GELU

(3) SwiGLU

(4) Advantages of SwiGLU

2.3 Chinchilla Scaling Laws

2.3.1 Original Assumption: Bigger = Better

2.3.2 Problem with GPT-3 Scaling

2.3.3 Chinchilla Scaling Law

2.3.4 LLaMA Follows Chinchilla

2.3.5 Benefits of Chinchilla-style Scaling

2.4 Output Projection

3. Model Sizes

4. Training Setup

5. Summary Formula

6. References

Share on

You may also enjoy

The Llama 3 Herd of Models

LLaMA 2: Open Foundation and Fine-Tuned Chat Models

Transformer: Attention is All You Need

From Q-learning to Soft Actor Critic