LLM Training Simplified: Building Your First Language Model

Setting Up the Foundation: Environment and Data for Mini-GPT

Welcome back to our Mini-GPT LLM training journey! In Part 1, we explored what we’re building and why. Now, we’ll set up our environment and prepare the data – the essential foundation before diving into model architecture.

LLM Training Simplified: Building Your First Language Model – 1
- LLM Training Simplified: Building Your First Language Model – 2 – We are here
LLM Training Simplified: Building Your First Language Model – 3
LLM Training Simplified: Building Your First Language Model – 4

Setting Up Our Development Environment

Before diving into model building, let’s set up a proper environment. We’ll need a few key packages:

Note: Python and pip installation is a pre-requisite

# Let's install our requirements
!pip install torch transformers datasets wandb tqdm matplotlib

# Import necessary libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import mathimport os
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer

# Check if we have GPU support (highly recommended!)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

This setup creates the foundation for our language model journey. Think of these packages as the essential toolkit for an AI craftsperson:

PyTorch provides the neural network foundation
Transformers gives us access to state-of-the-art model architectures
Datasets help us efficiently manage training data
WandB enables experiment tracking
Matplotlib lets us visualize our results

Pro Tip: If you’re running this on your local machine without a powerful GPU, don’t worry! Our Mini-GPT is designed to be trainable on modest hardware. That’s the beauty of starting small.

Exploring and Downloading Dataset Options For LLM Training

The quality of our training data directly impacts our model’s performance. Let’s explore some options:

def explore_datasets():
    """Show popular text datasets for language modeling."""
    print("Popular datasets for training language models:")
    print("1. wikitext-2 (~30MB): A small, clean Wikipedia dataset")
    print("2. wikitext-103 (~500MB): Larger Wikipedia dataset")
    print("3. bookcorpus (~4GB): Books dataset (takes longer to download)")
    print("4. oscar (~1TB full, but has subsets): Web text in many languages")
    print("5. c4 (Common Crawl): Massive web corpus with filtering")

Just as a chef needs quality ingredients, our language model needs quality data! This function gives us an overview of popular datasets, helping us make an informed choice. The key insight here is that dataset selection involves balancing several factors:

Size: Larger datasets provide more examples but require more training time
Quality: Clean, well-formatted text leads to better language understanding
Domain: The content type shapes what kind of language patterns the model learns
Practicality: For learning purposes, smaller datasets let us iterate faster

For our Mini-GPT, we’ll use WikiText-2 – a perfect dataset that’s not too big, not too small and contains high-quality educational content.

Downloading and Preparing Our Dataset

With our dataset selected, let’s download and prepare it:

def get_training_data(data_path="wikitext-2", split="train"):
    """Load and prepare our training data."""
    from datasets import load_dataset
    
    print(f"Downloading and preparing the {data_path} dataset...")
    
    # Load a small dataset (WikiText-2 is perfect for our Mini-GPT)
    # This will automatically download the dataset if it's not already cached
    dataset = load_dataset(data_path, split=split)
    
    # Preview what our data looks like
    print(f"Dataset size: {len(dataset)} documents")
    print(f"Sample text:\n{dataset[0]['text'][:200]}...")
    
    return dataset

# Load our dataset (this will download it if needed)
print("\nDownloading WikiText-2 dataset...")
train_data = get_training_data()

This function is our data procurement system. It connects to the Hugging Face datasets hub, downloads our chosen dataset, and gives us a quick preview. Behind the scenes, it’s handling network requests, file extraction, and data organization – tasks that would be tedious to implement manually.

You might want to save the dataset locally to avoid downloading it again:

def save_dataset_to_disk(dataset, output_dir="./data"):
    """Save a dataset to disk for faster loading in future."""
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    
    print(f"Saving dataset to {output_dir}...")
    dataset.save_to_disk(output_dir)
    print(f"Dataset saved successfully. Load it later with:")
    print(f"from datasets import load_from_disk")
    print(f"dataset = load_from_disk('{output_dir}')")

Understanding Tokenization and Vectorisation: From Text to Numbers

Computers don’t understand words – they understand numbers.

Curious about how tokenization shapes modern language understanding? Check out Hugging Face’s outstanding Tokenizers documentation which reveals the fascinating mechanics behind subword tokenization! This resource beautifully illustrates how tokenizers balance vocabulary size with semantic flexibility – transforming raw text into meaningful numerical patterns that machines can process efficiently.

To learn more about how words are transformed into numbers to train models read our blog on Text Vectorisation

def initialize_tokenizer():
    """Initialize our tokenizer with a smaller vocabulary."""
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    
    # Let's see what tokenization looks like
    example = "Building a language model is fun and educational!"
    tokens = tokenizer.encode(example)
    
    print(f"Original text: {example}")
    print(f"Tokenized: {tokens}")
    print(f"Decoded back: {tokenizer.decode(tokens)}")
    
    return tokenizer

tokenizer = initialize_tokenizer()

Tokenization is the magic translator between human language and machine-readable format. The function above shows this translation in action, converting readable text into numerical tokens and back again.

Think of tokenization as breaking a sentence into puzzle pieces. Some pieces might be whole words, while others might be word fragments. The beauty of modern tokenizers is that they can handle words they’ve never seen before by breaking them into familiar subword units – similar to how humans might understand a new word by recognizing its familiar parts.

Example

Let’s explore how tokenization works with an example:

def explore_tokenization(tokenizer):
    """Explore how different text gets tokenized."""
    examples = [
        "Hello world!",
        "artificial intelligence",
        "transformers are powerful",
        "GPT-3 generates text",
        "a b c d e f g",
        "antidisestablishmentarianism"  # Long word
    ]
    
    for example in examples:
        tokens = tokenizer.encode(example)
        token_words = [tokenizer.decode([t]) for t in tokens]
        
        print(f"\nText: \"{example}\"")
        print(f"Token IDs: {tokens}")
        print(f"Individual tokens: {token_words}")

When we run this function, we see fascinating results:

Text: "Hello world!"
Token IDs: [15496, 995, 0]
Individual tokens: ['Hello', ' world', '!']

Text: "antidisestablishmentarianism"
Token IDs: [34794, 3125, 423, 7867]
Individual tokens: ['ant', 'idi', 'sestablish', 'mentarianism']

Notice how simple words remain intact while complex words get broken into meaningful subparts. This clever approach helps language models handle unfamiliar words by recognizing patterns within them, like how we might decipher an unfamiliar word by recognizing its roots!

This visualization reveals how language models “see” text – not as discrete words, but as overlapping patterns of meaningful linguistic units.

Creating Our Dataset Class

Now, let’s build a custom PyTorch Dataset that prepares our text for training:

class MiniGPTDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=128):
        """
        Prepare texts for Mini-GPT training.
        Args:
            texts: List of text documents
            tokenizer: Our initialized tokenizer
            max_length: Maximum sequence length
        """
        self.tokenizer = tokenizer
        self.max_length = max_length
        
        # Tokenize all texts and join them
        print("Tokenizing dataset...")
        self.all_tokens = []
        for text in texts:
            tokens = tokenizer.encode(text)
            self.all_tokens.extend(tokens)
        
        # Create examples of length max_length
        self.examples = []
        for i in range(0, len(self.all_tokens) - max_length, max_length):
            self.examples.append(self.all_tokens[i:i + max_length])
        
        print(f"Created {len(self.examples)} training examples")
    
    def __len__(self):
        return len(self.examples)
    
    def __getitem__(self, idx):
        # Get tokens for this example
        tokens = self.examples[idx]
        
        # Input: all tokens except last one
        # Target: all tokens except first one (shifted by 1)
        x = torch.tensor(tokens[:-1], dtype=torch.long)
        y = torch.tensor(tokens[1:], dtype=torch.long)
        
        return x, y

This dataset class is more than just a container – it’s an intelligent data preparer. It performs several critical tasks:

Tokenization: Converting all text to numerical token IDs
Chunking: Breaking long documents into training-friendly sequences
Input/Target Creation: Setting up the “predict the next token” task
Batching: Preparing data for efficient GPU processing

Notice the clever way we set up our inputs and targets:

for a sequence like [A, B, C, D], the input becomes [A, B, C] and the target becomes [B, C, D]. This teaches the model to predict the next token given the previous ones – the core of language modelling.

Creating Data Loaders for Efficient LLM Training

To efficiently feed data to our model during training, we’ll use PyTorch’s DataLoader:

def prepare_dataset_from_huggingface(dataset, tokenizer, max_length=128, column_name='text'):
    """
    Convert a Hugging Face dataset into our custom MiniGPT format.
    """
    # Extract all texts from the dataset
    all_texts = dataset[column_name]
    
    # Create our custom dataset
    return MiniGPTDataset(all_texts, tokenizer, max_length)

def create_dataloader(dataset, batch_size=16, shuffle=True, num_workers=2):
    """Create a DataLoader for efficient batch processing."""
    return DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        num_workers=num_workers,
        pin_memory=True  # This helps speed up data transfer to GPU
    )

These functions create an optimized data pipeline. Think of DataLoaders as conveyor belts in a factory, delivering perfectly sized batches of data to our model. They handle:

Shuffling data for better learning (like shuffling flashcards)
Parallelizing data loading across CPU cores
Transferring data efficiently between CPU and GPU
Maintaining consistent batch sizes

Visualizing Our Dataset

Let’s visualize some examples from our dataset to better understand what our model will learn from:

def visualize_dataset_examples(dataset, tokenizer, num_examples=3):
    """Visualize some examples from our dataset."""
    import random
    
    indices = random.sample(range(len(dataset)), num_examples)
    
    for i, idx in enumerate(indices):
        x, y = dataset[idx]
        
        # Decode tokens back to text
        input_text = tokenizer.decode(x)
        target_text = tokenizer.decode(y)
        
        print(f"\n--- Example {i+1} ---")
        print(f"Input tokens shape: {x.shape}")
        print(f"Target tokens shape: {y.shape}")
        print(f"\nInput text snippet: \"{input_text[:100]}...\"")
        print(f"Target text snippet: \"{target_text[:100]}...\"")
        
        # Show how they overlap
        print("\nNotice how the target is shifted by one token:")
        for j in range(min(5, len(x))):
            print(f"Input token {j}: '{tokenizer.decode([x[j]])}'")
            print(f"Target token {j}: '{tokenizer.decode([y[j]])}'")

Visualization helps us confirm our data preparation is working correctly. It shows:

How text is broken into token sequences
The relationship between input and target sequences
The shifting pattern that creates our next-token prediction task

Understanding Next-Token Prediction

To better grasp how our model learns, let’s visualize the next token prediction task:

def visualize_next_token_prediction():
    """Visualize how next-token prediction works."""
    plt.figure(figsize=(12, 6))
    input_text = "The cat sat on the"
    possible_next = ["mat", "chair", "dog", "roof", "floor"]
    probabilities = [0.45, 0.3, 0.05, 0.1, 0.1]
    
    # Top: input sequence
    for i, word in enumerate(input_text.split()):
        plt.text(i, 1.0, word, ha='center', va='center',
                bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.5))
        # Draw arrows between words
        if i < len(input_text.split()) - 1:
            plt.arrow(i + 0.25, 1.0, 0.5, 0, head_width=0.05, head_length=0.1, 
                     fc='black', ec='black')
    
    # Bottom: possible next tokens with probabilities  
    for i, (word, prob) in enumerate(zip(possible_next, probabilities)):
        plt.text(len(input_text.split()) + i/len(possible_next), 0.5, word,
                ha='center', va='center', rotation=45,
                bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.5))
    
    # Draw probability arrows
    for i, prob in enumerate(probabilities):
        arrow_width = prob * 2
        plt.arrow(len(input_text.split())-0.25, 0.95,
                 i/len(possible_next), -0.35,
                 head_width=0.05, head_length=0.1,
                 width=arrow_width/40,
                 fc='red', ec='red', alpha=prob)

This visualization brings our core training objective to life. Imagine the model reading “The cat sat on the” and calculating probabilities for what might come next: “mat” (45%), “chair” (30%), “floor” (10%), etc.

Next-token prediction is elegant in its simplicity yet powerful in its results. By learning to predict what comes next in a sequence, the model naturally develops:

Understanding of grammar and syntax
Grasp of factual relationships
Awareness of narrative patterns
Sense of logical flow

What We’ve Learned so far in LLM Training

In this article, we’ve:

Set up our development environment
Explored and downloaded a suitable dataset
Learned how tokenization converts text to numbers
Built a custom dataset class for efficient training
Created utilities for data processing and visualization

We now have everything we need to start building our model! In the next part, we’ll implement each component of the Mini-GPT architecture, understanding how they work together to process and generate text.

Coming Next in LLM Training – Part 3

We’ll dive into the core components of our Mini-GPT:

Token embeddings: Giving meaning to numbers
Positional encodings: Adding location-awareness
Attention mechanisms: The “secret sauce” of transformers
Feed-forward networks: Processing information
Assembling a complete transformer model

Stay tuned as we bring these building blocks together to create our language model! 🚀

Exercise for the Reader

Before moving to LLM training Part 3, try these exercises:

Experiment with tokenizing different types of text (code, poetry, technical writing)
Create a custom dataset from your favourite book or article
Visualize the distribution of token lengths in your dataset

Happy coding! Check out our GitHub repository for the complete implementation of these functions and classes.

For readers interested in the efficiency and practicality of smaller AI models, check out our article on Small Language Models: A Breakthrough in AI Sustainability. It explores how these lightweight alternatives deliver impressive results with significantly fewer resources — perfect for understanding the future direction of sustainable AI development!

< Part-1

Part -3 >

AI ML Universe

LLM Training Simplified: Building Your First Language Model – 2

Setting Up the Foundation: Environment and Data for Mini-GPT

Setting Up Our Development Environment

Exploring and Downloading Dataset Options For LLM Training

Downloading and Preparing Our Dataset

Understanding Tokenization and Vectorisation: From Text to Numbers

Example

Creating Our Dataset Class

Creating Data Loaders for Efficient LLM Training

Visualizing Our Dataset

Understanding Next-Token Prediction

What We’ve Learned so far in LLM Training

Coming Next in LLM Training – Part 3

Exercise for the Reader

One response to “LLM Training Simplified: Building Your First Language Model – 2”

Leave a Reply Cancel reply

Search

Contents

About

Archive

Recent Post

Tags