Building the Brain: Implementing Mini-GPT’s Core Components
Welcome to Part 3 of our Mini-GPT LLM training journey! Now that we’ve set up our environment and data pipeline in Part 2, let’s build the actual model components that make language understanding possible.
- LLM Training Simplified: Building Your First Language Model – 1
- LLM Training Simplified: Building Your First Language Model – 2
- LLM Training Simplified: Building Your First Language Model – 4
The Building Blocks of LLM Training: A Visual Overview
Before diving into code, let’s understand what we’re building:
Our Mini-GPT follows a simplified transformer architecture with these essential components:
- Token Embeddings: Turning words into numbers
- Positional Encodings: Adding location-awareness
- Self-Attention: Understanding relationships between words
- Feed-Forward Networks: Processing the information
- The Full Model: Connecting everything together
Let’s build each one step-by-step!
1. Token Embeddings: Words as Vectors
Imagine each word in your vocabulary having its unique position in a high-dimensional space. That’s essentially what token embeddings do:
class TokenEmbedding(nn.Module):
def __init__(self, vocab_size, embed_dim):
"""Convert token IDs to meaningful vectors."""
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.embed_dim = embed_dim
def forward(self, x):
# Scale embeddings for more stable gradients
return self.embedding(x) * math.sqrt(self.embed_dim)
Think of embeddings as giving each word its unique “personality profile” – a numerical fingerprint capturing its meaning. When you see vocab_size, that’s the total number of words our model knows, while embed_dim determines how detailed each word’s “profile” will be.
The scaling factor (multiplying by the square root of dimension) might seem mysterious, but it serves a critical purpose: it helps keep our gradients stable during training. Training would be more likely to diverge or get stuck without this scaling.
Beginner’s Note: It’s like assigning coordinates to each word in a “meaning space.” Words with similar meanings will have nearby coordinates. For a fascinating visualization of how these word vectors capture relationships, check out Embedding Projector where you can actually explore word embeddings in 3D space!
2. Positional Encodings: Order Matters!
Unlike humans, neural networks don’t automatically understand word order. Positional encodings solve this problem:
class PositionalEncoding(nn.Module):
def __init__(self, embed_dim, max_seq_length=1000):
"""Add position information to embeddings."""
super().__init__()
# Create position encodings matrix
pe = torch.zeros(max_seq_length, embed_dim)
position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(
torch.arange(0, embed_dim, 2).float() * (-math.log(10000.0) / embed_dim)
)
# Apply sine to even indices
pe[:, 0::2] = torch.sin(position * div_term)
# Apply cosine to odd indices
pe[:, 1::2] = torch.cos(position * div_term)
# Register buffer (not a parameter, but part of the module)
self.register_buffer('pe', pe.unsqueeze(0))
This seemingly complex code implements an elegant mathematical pattern. Instead of having the model learn position information from scratch (which would be difficult), we provide it with carefully designed positional “fingerprints.”
Every position in a sequence gets its unique encoding pattern that the model can easily recognize. The magic happens through sine and cosine waves of different frequencies – position 1 gets one pattern, position 2 gets a slightly different pattern, and so on. These patterns are designed so that:
- Each position gets a unique encoding
- The model can easily compute relative positions
- The pattern generalizes to positions it hasn’t seen during training
Beginner’s Note: This is like giving each word a “location stamp” so the model knows if a word appears at the beginning, middle, or end of a sentence.
💡Deeper Dive: Curious about why sine and cosine functions work so well for positional encoding? Check out The Positional Encoding paper which beautifully explains the mathematical intuition behind this elegant approach!
3. Self-Attention: Understanding Relationships
This is the “magic” that helps our model understand how words relate to each other:
class MultiHeadAttention(nn.Module):
def __init__(self, embed_dim, num_heads, dropout=0.1):
"""Multi-head attention mechanism."""
super().__init__()
self.embed_dim = embed_dim
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
# Create projections for queries, keys, values
self.query = nn.Linear(embed_dim, embed_dim)
self.key = nn.Linear(embed_dim, embed_dim)
self.value = nn.Linear(embed_dim, embed_dim)
self.output_projection = nn.Linear(embed_dim, embed_dim)
self.dropout = nn.Dropout(dropout)
The attention mechanism is the revolutionary breakthrough that powers modern language models. Think of it as the model’s ability to “focus” on relevant parts of the input when making predictions.
The concept is beautifully intuitive: for each word (or token) in your sequence, the model asks:
- “What am I looking for?” (query)
- “What do other words have to offer?” (key)
- “What information do I collect?” (value)
For example, when processing “The cat sat on the mat because it was comfortable,” the word “it” creates a query that strongly matches “mat” (not “cat”). This allows the model to understand that “it” refers to the mat in this context.
The “multi-head” part means the model runs multiple attention mechanisms in parallel, each potentially focusing on different aspects of language (like grammar, subject-object relationships, or topic relevance).
🔍 Beginner’s Note: Imagine each word asking “How relevant are other words to me?” Self-attention calculates these relevance scores and uses them to create a context-aware representation.
4. Feed-Forward Networks: Processing Information
After gathering context through attention, each position gets processed independently:
class FeedForward(nn.Module):
def __init__(self, embed_dim, ff_dim, dropout=0.1):
"""Position-wise feed-forward network."""
super().__init__()
self.net = nn.Sequential(
nn.Linear(embed_dim, ff_dim),
nn.GELU(), # Modern activation function
nn.Dropout(dropout),
nn.Linear(ff_dim, embed_dim),
nn.Dropout(dropout)
)
While this component might seem simple compared to attention, it plays a crucial role. The feed-forward network is like each word’s “thinking time” – after gathering context from other words through attention, each word gets processed independently through this neural network.
Why do we need this? The feed-forward network:
- Adds computational capacity to the model
- Processes the gathered contextual information
- Allows non-linear transformations of the data
- Serves as the model’s “reasoning” component after attention
The expanded inner dimension (ff_dim
, typically 4x larger than embed_dim
) gives the network more expressive power – like expanding the size of a whiteboard for complex calculations before summarizing the results.
🔍 Beginner’s Note: This is like each word “thinking about” the information it gathered through attention. The expanded middle layer gives it more “thinking capacity.”
5. Transformer Block: Combining the Pieces
Now, let’s put attention and feed-forward together with residual connections:
class TransformerBlock(nn.Module):
def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
"""A single transformer block."""
super().__init__()
# Components
self.attention = MultiHeadAttention(embed_dim, num_heads, dropout)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.feed_forward = FeedForward(embed_dim, ff_dim, dropout)
self.dropout = nn.Dropout(dropout)
The transformer block combines all our previous components into a cohesive processing unit. It follows an elegant and effective pattern:
- Normalize the input (helps training stability)
- Apply attention (gather context)
- Add the result to the original input (residual connection)
- Normalize again
- Apply feed-forward network (process information)
- Add the result to the previous step (another residual connection)
These residual connections (the adding steps) are crucial – they create highways for information to flow through the network. Without them, deep transformers would be nearly impossible to train.
The layer normalization (nn.LayerNorm
) is like a “reset button” that prevents the values from growing too large or too small as they pass through many layers.
🔍 Beginner’s Note: The residual connections (adding the original input) help information flow through the network, making it easier to train deeper models.
🌐 Interactive Learning: For an incredible hands-on experience with transformers, explore The Transformer Playground where you can watch each component process text in real time and see exactly how information flows through the model.
6. The Complete Mini-GPT LLM training Model
Finally, let’s assemble everything into our complete model:
class MiniGPT(nn.Module):
def __init__(
self,
vocab_size=50257, # Default GPT-2 vocabulary size
max_seq_length=128,
embed_dim=256,
num_heads=8,
num_layers=4,
ff_dim=1024,
dropout=0.1
):
"""Mini-GPT language model."""
super().__init__()
# Token embeddings
self.token_embed = TokenEmbedding(vocab_size, embed_dim)
# Positional encodings
self.pos_encoding = PositionalEncoding(embed_dim, max_seq_length)
# Dropout
self.dropout = nn.Dropout(dropout)
# Transformer blocks
self.blocks = nn.ModuleList([
TransformerBlock(embed_dim, num_heads, ff_dim, dropout)
for _ in range(num_layers)
])
# Final normalization and output projection
self.norm = nn.LayerNorm(embed_dim)
self.output = nn.Linear(embed_dim, vocab_size)
Our complete Mini-GPT brings together all the components we’ve built. The architecture follows a logical flow:
- Embedding Layer: Convert token IDs to vectors and add positional information
- Transformer Blocks: Process the tokens through multiple layers of attention and feed-forward networks
- Output Layer: Project the final representations back to vocabulary-sized logits
The model is autoregressive – it predicts one token at a time by only looking at previous tokens. This is achieved through masking in the attention mechanism (the mask
parameter), which ensures each position can only attend to positions that came before it.
Testing Our Model
Let’s make sure everything works as expected:
def test_model():
"""Test that our model processes inputs correctly."""
# Create a small model
model = MiniGPT(
vocab_size=50257,
max_seq_length=64,
embed_dim=128,
num_heads=4,
num_layers=2,
ff_dim=512
).to(device)
# Generate random input
batch_size = 2
seq_len = 16
x = torch.randint(0, 1000, (batch_size, seq_len)).to(device)
# Forward pass
with torch.no_grad():
output = model(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
# Our output should have the vocabulary size in the last dimension
assert output.shape == (batch_size, seq_len, 50257)
print("Model test passed!")
This test function creates a mini version of our model and verifies that it can process inputs and produce outputs of the expected shape. While simple, it helps catch any fundamental issues with our architecture before we commit to training.
When we run this function, we’d expect to see something like:
Input shape: torch.Size([2, 16])
Output shape: torch.Size([2, 16, 50257])
Model test passed!
Think of this test as a sanity check before embarking on hours of training – similar to checking your car’s engine before a cross-country journey. It creates a miniature version of our language model with just 2 transformer layers (instead of 4) and confirms it can:
- Process batched sequences of tokens
- Transform them through the entire neural architecture
- Produce properly shaped probability distributions over our vocabulary
The beauty of this test lies in its simplicity. By creating random token inputs and verifying the output dimensions, we can catch architectural bugs that might otherwise waste hours of precious training time. It’s like having a safety net before the high-wire act of model training begins!
Key Insights: How Our Model Works
Let’s understand some important characteristics of our Mini-GPT:
- Model Size: ~22 million parameters (compared to GPT-2 Small’s 124M)
- Structure: 4 transformer layers with 8 attention heads each
- Context Window: Can handle sequences up to 128 tokens long
- Prediction: Autoregressive (predicts one token at a time)
The causal masking ensures each position can only attend to previous positions – this is critical for autoregressive generation where we predict one token at a time.
Coming Next in LLM Training: Training and Generation
In the next part, we’ll:
- Implement the training loop
- Create text generation functions
- Train our model on real data
- Generate our first AI-created text!
Exercise for the Reader
Before continuing to Part 4 of LLM Training, try these exercises:
- Calculate how many parameters are in each component of our model
- Experiment with different model sizes (embed_dim, num_layers, etc.)
- Add a function to save and load model checkpoints
Want a deeper understanding? For an accessible yet comprehensive explanation of transformer efficiency, check out The Illustrated Transformer – it’s widely considered the gold standard for intuitively visualizing these complex models!
Happy coding! For the complete implementation of the model architecture, check out our GitHub repository.
2 responses to “LLM Training Simplified: Building Your First Language Model – 3”
[…] LLM Training Simplified: Building Your First Language Model – 3 […]
[…] LLM Training Simplified: Building Your First Language Model – 3 […]