LLM Training Simplified: Building Your First Language Model

Building the Brain: Implementing Mini-GPT’s Core Components

Welcome to Part 3 of our Mini-GPT LLM training journey! Now that we’ve set up our environment and data pipeline in Part 2, let’s build the actual model components that make language understanding possible.

LLM Training Simplified: Building Your First Language Model – 1
LLM Training Simplified: Building Your First Language Model – 2
- LLM Training Simplified: Building Your First Language Model – 3 – We are here
LLM Training Simplified: Building Your First Language Model – 4

The Building Blocks of LLM Training: A Visual Overview

Before diving into code, let’s understand what we’re building:

Our Mini-GPT follows a simplified transformer architecture with these essential components:

Token Embeddings: Turning words into numbers
Positional Encodings: Adding location-awareness
Self-Attention: Understanding relationships between words
Feed-Forward Networks: Processing the information
The Full Model: Connecting everything together

Let’s build each one step-by-step!

1. Token Embeddings: Words as Vectors

Imagine each word in your vocabulary having its unique position in a high-dimensional space. That’s essentially what token embeddings do:

class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        """Convert token IDs to meaningful vectors."""
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.embed_dim = embed_dim

    def forward(self, x):
        # Scale embeddings for more stable gradients
        return self.embedding(x) * math.sqrt(self.embed_dim)

Think of embeddings as giving each word its unique “personality profile” – a numerical fingerprint capturing its meaning. When you see vocab_size, that’s the total number of words our model knows, while embed_dim determines how detailed each word’s “profile” will be.

The scaling factor (multiplying by the square root of dimension) might seem mysterious, but it serves a critical purpose: it helps keep our gradients stable during training. Training would be more likely to diverge or get stuck without this scaling.

Beginner’s Note: It’s like assigning coordinates to each word in a “meaning space.” Words with similar meanings will have nearby coordinates. For a fascinating visualization of how these word vectors capture relationships, check out Embedding Projector where you can actually explore word embeddings in 3D space!

2. Positional Encodings: Order Matters!

Unlike humans, neural networks don’t automatically understand word order. Positional encodings solve this problem:

class PositionalEncoding(nn.Module):
    def __init__(self, embed_dim, max_seq_length=1000):
        """Add position information to embeddings."""
        super().__init__()
        # Create position encodings matrix
        pe = torch.zeros(max_seq_length, embed_dim)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, embed_dim, 2).float() * (-math.log(10000.0) / embed_dim)
        )
        # Apply sine to even indices
        pe[:, 0::2] = torch.sin(position * div_term)
        # Apply cosine to odd indices
        pe[:, 1::2] = torch.cos(position * div_term)
        # Register buffer (not a parameter, but part of the module)
        self.register_buffer('pe', pe.unsqueeze(0))

This seemingly complex code implements an elegant mathematical pattern. Instead of having the model learn position information from scratch (which would be difficult), we provide it with carefully designed positional “fingerprints.”

Every position in a sequence gets its unique encoding pattern that the model can easily recognize. The magic happens through sine and cosine waves of different frequencies – position 1 gets one pattern, position 2 gets a slightly different pattern, and so on. These patterns are designed so that:

Each position gets a unique encoding
The model can easily compute relative positions
The pattern generalizes to positions it hasn’t seen during training

Beginner’s Note: This is like giving each word a “location stamp” so the model knows if a word appears at the beginning, middle, or end of a sentence.

💡Deeper Dive: Curious about why sine and cosine functions work so well for positional encoding? Check out The Positional Encoding paper which beautifully explains the mathematical intuition behind this elegant approach!

3. Self-Attention: Understanding Relationships

This is the “magic” that helps our model understand how words relate to each other:

class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads, dropout=0.1):
        """Multi-head attention mechanism."""
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        # Create projections for queries, keys, values
        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)
        self.output_projection = nn.Linear(embed_dim, embed_dim)
        self.dropout = nn.Dropout(dropout)

The attention mechanism is the revolutionary breakthrough that powers modern language models. Think of it as the model’s ability to “focus” on relevant parts of the input when making predictions.

The concept is beautifully intuitive: for each word (or token) in your sequence, the model asks:

“What am I looking for?” (query)
“What do other words have to offer?” (key)
“What information do I collect?” (value)

For example, when processing “The cat sat on the mat because it was comfortable,” the word “it” creates a query that strongly matches “mat” (not “cat”). This allows the model to understand that “it” refers to the mat in this context.

The “multi-head” part means the model runs multiple attention mechanisms in parallel, each potentially focusing on different aspects of language (like grammar, subject-object relationships, or topic relevance).

🔍 Beginner’s Note: Imagine each word asking “How relevant are other words to me?” Self-attention calculates these relevance scores and uses them to create a context-aware representation.

4. Feed-Forward Networks: Processing Information

After gathering context through attention, each position gets processed independently:

class FeedForward(nn.Module):
    def __init__(self, embed_dim, ff_dim, dropout=0.1):
        """Position-wise feed-forward network."""
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.GELU(),  # Modern activation function
            nn.Dropout(dropout),
            nn.Linear(ff_dim, embed_dim),
            nn.Dropout(dropout)
        )

While this component might seem simple compared to attention, it plays a crucial role. The feed-forward network is like each word’s “thinking time” – after gathering context from other words through attention, each word gets processed independently through this neural network.

Why do we need this? The feed-forward network:

Adds computational capacity to the model
Processes the gathered contextual information
Allows non-linear transformations of the data
Serves as the model’s “reasoning” component after attention

The expanded inner dimension (ff_dim, typically 4x larger than embed_dim) gives the network more expressive power – like expanding the size of a whiteboard for complex calculations before summarizing the results.

🔍 Beginner’s Note: This is like each word “thinking about” the information it gathered through attention. The expanded middle layer gives it more “thinking capacity.”

5. Transformer Block: Combining the Pieces

Now, let’s put attention and feed-forward together with residual connections:

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
        """A single transformer block."""
        super().__init__()
        # Components
        self.attention = MultiHeadAttention(embed_dim, num_heads, dropout)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.feed_forward = FeedForward(embed_dim, ff_dim, dropout)
        self.dropout = nn.Dropout(dropout)

The transformer block combines all our previous components into a cohesive processing unit. It follows an elegant and effective pattern:

Normalize the input (helps training stability)
Apply attention (gather context)
Add the result to the original input (residual connection)
Normalize again
Apply feed-forward network (process information)
Add the result to the previous step (another residual connection)

These residual connections (the adding steps) are crucial – they create highways for information to flow through the network. Without them, deep transformers would be nearly impossible to train.

The layer normalization (nn.LayerNorm) is like a “reset button” that prevents the values from growing too large or too small as they pass through many layers.

🔍 Beginner’s Note: The residual connections (adding the original input) help information flow through the network, making it easier to train deeper models.

🌐 Interactive Learning: For an incredible hands-on experience with transformers, explore The Transformer Playground where you can watch each component process text in real time and see exactly how information flows through the model.

6. The Complete Mini-GPT LLM training Model

Finally, let’s assemble everything into our complete model:

class MiniGPT(nn.Module):
    def __init__(
        self,
        vocab_size=50257,  # Default GPT-2 vocabulary size
        max_seq_length=128,
        embed_dim=256,
        num_heads=8,
        num_layers=4,
        ff_dim=1024,
        dropout=0.1
    ):
        """Mini-GPT language model."""
        super().__init__()

        # Token embeddings
        self.token_embed = TokenEmbedding(vocab_size, embed_dim)

        # Positional encodings
        self.pos_encoding = PositionalEncoding(embed_dim, max_seq_length)

        # Dropout
        self.dropout = nn.Dropout(dropout)

        # Transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(embed_dim, num_heads, ff_dim, dropout)
            for _ in range(num_layers)
        ])

        # Final normalization and output projection
        self.norm = nn.LayerNorm(embed_dim)
        self.output = nn.Linear(embed_dim, vocab_size)

Our complete Mini-GPT brings together all the components we’ve built. The architecture follows a logical flow:

Embedding Layer: Convert token IDs to vectors and add positional information
Transformer Blocks: Process the tokens through multiple layers of attention and feed-forward networks
Output Layer: Project the final representations back to vocabulary-sized logits

The model is autoregressive – it predicts one token at a time by only looking at previous tokens. This is achieved through masking in the attention mechanism (the mask parameter), which ensures each position can only attend to positions that came before it.

Testing Our Model

Let’s make sure everything works as expected:

def test_model():
    """Test that our model processes inputs correctly."""
    # Create a small model
    model = MiniGPT(
        vocab_size=50257,
        max_seq_length=64,
        embed_dim=128,
        num_heads=4,
        num_layers=2,
        ff_dim=512
    ).to(device)

    # Generate random input
    batch_size = 2
    seq_len = 16
    x = torch.randint(0, 1000, (batch_size, seq_len)).to(device)

    # Forward pass
    with torch.no_grad():
        output = model(x)

    print(f"Input shape: {x.shape}")
    print(f"Output shape: {output.shape}")

    # Our output should have the vocabulary size in the last dimension
    assert output.shape == (batch_size, seq_len, 50257)

    print("Model test passed!")

This test function creates a mini version of our model and verifies that it can process inputs and produce outputs of the expected shape. While simple, it helps catch any fundamental issues with our architecture before we commit to training.

When we run this function, we’d expect to see something like:

Input shape: torch.Size([2, 16])
Output shape: torch.Size([2, 16, 50257])
Model test passed!

Think of this test as a sanity check before embarking on hours of training – similar to checking your car’s engine before a cross-country journey. It creates a miniature version of our language model with just 2 transformer layers (instead of 4) and confirms it can:

Process batched sequences of tokens
Transform them through the entire neural architecture
Produce properly shaped probability distributions over our vocabulary

The beauty of this test lies in its simplicity. By creating random token inputs and verifying the output dimensions, we can catch architectural bugs that might otherwise waste hours of precious training time. It’s like having a safety net before the high-wire act of model training begins!

Key Insights: How Our Model Works

Let’s understand some important characteristics of our Mini-GPT:

Model Size: ~22 million parameters (compared to GPT-2 Small’s 124M)
Structure: 4 transformer layers with 8 attention heads each
Context Window: Can handle sequences up to 128 tokens long
Prediction: Autoregressive (predicts one token at a time)

The causal masking ensures each position can only attend to previous positions – this is critical for autoregressive generation where we predict one token at a time.

Coming Next in LLM Training: Training and Generation

In the next part, we’ll:

Implement the training loop
Create text generation functions
Train our model on real data
Generate our first AI-created text!

Exercise for the Reader

Before continuing to Part 4 of LLM Training, try these exercises:

Calculate how many parameters are in each component of our model
Experiment with different model sizes (embed_dim, num_layers, etc.)
Add a function to save and load model checkpoints

Want a deeper understanding? For an accessible yet comprehensive explanation of transformer efficiency, check out The Illustrated Transformer – it’s widely considered the gold standard for intuitively visualizing these complex models!

Happy coding! For the complete implementation of the model architecture, check out our GitHub repository.

< Part -2

Part – 4 >

2 responses to “LLM Training Simplified: Building Your First Language Model – 3”

LLM Training Simplified: Building Your First Language Model – 2

March 3, 2025

[…] LLM Training Simplified: Building Your First Language Model – 3 […]

LLM Training Simplified: Building Your First Language Model – 1

March 3, 2025

[…] LLM Training Simplified: Building Your First Language Model – 3 […]

AI ML Universe

LLM Training Simplified: Building Your First Language Model – 3

Building the Brain: Implementing Mini-GPT’s Core Components

The Building Blocks of LLM Training: A Visual Overview

1. Token Embeddings: Words as Vectors

2. Positional Encodings: Order Matters!

3. Self-Attention: Understanding Relationships

4. Feed-Forward Networks: Processing Information

5. Transformer Block: Combining the Pieces

6. The Complete Mini-GPT LLM training Model

Testing Our Model

Key Insights: How Our Model Works

Coming Next in LLM Training: Training and Generation

Exercise for the Reader

2 responses to “LLM Training Simplified: Building Your First Language Model – 3”

Leave a Reply Cancel reply

Search

Contents

About

Archive

Recent Post

Tags