Setting Up the Foundation: Environment and Data for Mini-GPT
Welcome back to our Mini-GPT LLM training journey! In Part 1, we explored what we’re building and why. Now, we’ll set up our environment and prepare the data – the essential foundation before diving into model architecture.
- LLM Training Simplified: Building Your First Language Model – 1
- LLM Training Simplified: Building Your First Language Model – 3
- LLM Training Simplified: Building Your First Language Model – 4
Setting Up Our Development Environment
Before diving into model building, let’s set up a proper environment. We’ll need a few key packages:
Note: Python and pip installation is a pre-requisite
# Let's install our requirements
!pip install torch transformers datasets wandb tqdm matplotlib
# Import necessary libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import mathimport os
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer
# Check if we have GPU support (highly recommended!)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
This setup creates the foundation for our language model journey. Think of these packages as the essential toolkit for an AI craftsperson:
- PyTorch provides the neural network foundation
- Transformers gives us access to state-of-the-art model architectures
- Datasets help us efficiently manage training data
- WandB enables experiment tracking
- Matplotlib lets us visualize our results
Pro Tip: If you’re running this on your local machine without a powerful GPU, don’t worry! Our Mini-GPT is designed to be trainable on modest hardware. That’s the beauty of starting small.
Exploring and Downloading Dataset Options For LLM Training
The quality of our training data directly impacts our model’s performance. Let’s explore some options:
def explore_datasets():
"""Show popular text datasets for language modeling."""
print("Popular datasets for training language models:")
print("1. wikitext-2 (~30MB): A small, clean Wikipedia dataset")
print("2. wikitext-103 (~500MB): Larger Wikipedia dataset")
print("3. bookcorpus (~4GB): Books dataset (takes longer to download)")
print("4. oscar (~1TB full, but has subsets): Web text in many languages")
print("5. c4 (Common Crawl): Massive web corpus with filtering")
Just as a chef needs quality ingredients, our language model needs quality data! This function gives us an overview of popular datasets, helping us make an informed choice. The key insight here is that dataset selection involves balancing several factors:
- Size: Larger datasets provide more examples but require more training time
- Quality: Clean, well-formatted text leads to better language understanding
- Domain: The content type shapes what kind of language patterns the model learns
- Practicality: For learning purposes, smaller datasets let us iterate faster
For our Mini-GPT, we’ll use WikiText-2 – a perfect dataset that’s not too big, not too small and contains high-quality educational content.
Downloading and Preparing Our Dataset
With our dataset selected, let’s download and prepare it:
def get_training_data(data_path="wikitext-2", split="train"):
"""Load and prepare our training data."""
from datasets import load_dataset
print(f"Downloading and preparing the {data_path} dataset...")
# Load a small dataset (WikiText-2 is perfect for our Mini-GPT)
# This will automatically download the dataset if it's not already cached
dataset = load_dataset(data_path, split=split)
# Preview what our data looks like
print(f"Dataset size: {len(dataset)} documents")
print(f"Sample text:\n{dataset[0]['text'][:200]}...")
return dataset
# Load our dataset (this will download it if needed)
print("\nDownloading WikiText-2 dataset...")
train_data = get_training_data()
This function is our data procurement system. It connects to the Hugging Face datasets hub, downloads our chosen dataset, and gives us a quick preview. Behind the scenes, it’s handling network requests, file extraction, and data organization – tasks that would be tedious to implement manually.
You might want to save the dataset locally to avoid downloading it again:
def save_dataset_to_disk(dataset, output_dir="./data"):
"""Save a dataset to disk for faster loading in future."""
if not os.path.exists(output_dir):
os.makedirs(output_dir)
print(f"Saving dataset to {output_dir}...")
dataset.save_to_disk(output_dir)
print(f"Dataset saved successfully. Load it later with:")
print(f"from datasets import load_from_disk")
print(f"dataset = load_from_disk('{output_dir}')")
Understanding Tokenization and Vectorisation: From Text to Numbers
Computers don’t understand words – they understand numbers.
Curious about how tokenization shapes modern language understanding? Check out Hugging Face’s outstanding Tokenizers documentation which reveals the fascinating mechanics behind subword tokenization! This resource beautifully illustrates how tokenizers balance vocabulary size with semantic flexibility – transforming raw text into meaningful numerical patterns that machines can process efficiently.
To learn more about how words are transformed into numbers to train models read our blog on Text Vectorisation
def initialize_tokenizer():
"""Initialize our tokenizer with a smaller vocabulary."""
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Let's see what tokenization looks like
example = "Building a language model is fun and educational!"
tokens = tokenizer.encode(example)
print(f"Original text: {example}")
print(f"Tokenized: {tokens}")
print(f"Decoded back: {tokenizer.decode(tokens)}")
return tokenizer
tokenizer = initialize_tokenizer()
Tokenization is the magic translator between human language and machine-readable format. The function above shows this translation in action, converting readable text into numerical tokens and back again.
Think of tokenization as breaking a sentence into puzzle pieces. Some pieces might be whole words, while others might be word fragments. The beauty of modern tokenizers is that they can handle words they’ve never seen before by breaking them into familiar subword units – similar to how humans might understand a new word by recognizing its familiar parts.
Example
Let’s explore how tokenization works with an example:
def explore_tokenization(tokenizer):
"""Explore how different text gets tokenized."""
examples = [
"Hello world!",
"artificial intelligence",
"transformers are powerful",
"GPT-3 generates text",
"a b c d e f g",
"antidisestablishmentarianism" # Long word
]
for example in examples:
tokens = tokenizer.encode(example)
token_words = [tokenizer.decode([t]) for t in tokens]
print(f"\nText: \"{example}\"")
print(f"Token IDs: {tokens}")
print(f"Individual tokens: {token_words}")
When we run this function, we see fascinating results:
Text: "Hello world!"
Token IDs: [15496, 995, 0]
Individual tokens: ['Hello', ' world', '!']
Text: "antidisestablishmentarianism"
Token IDs: [34794, 3125, 423, 7867]
Individual tokens: ['ant', 'idi', 'sestablish', 'mentarianism']
Notice how simple words remain intact while complex words get broken into meaningful subparts. This clever approach helps language models handle unfamiliar words by recognizing patterns within them, like how we might decipher an unfamiliar word by recognizing its roots!
This visualization reveals how language models “see” text – not as discrete words, but as overlapping patterns of meaningful linguistic units.
Creating Our Dataset Class
Now, let’s build a custom PyTorch Dataset that prepares our text for training:
class MiniGPTDataset(Dataset):
def __init__(self, texts, tokenizer, max_length=128):
"""
Prepare texts for Mini-GPT training.
Args:
texts: List of text documents
tokenizer: Our initialized tokenizer
max_length: Maximum sequence length
"""
self.tokenizer = tokenizer
self.max_length = max_length
# Tokenize all texts and join them
print("Tokenizing dataset...")
self.all_tokens = []
for text in texts:
tokens = tokenizer.encode(text)
self.all_tokens.extend(tokens)
# Create examples of length max_length
self.examples = []
for i in range(0, len(self.all_tokens) - max_length, max_length):
self.examples.append(self.all_tokens[i:i + max_length])
print(f"Created {len(self.examples)} training examples")
def __len__(self):
return len(self.examples)
def __getitem__(self, idx):
# Get tokens for this example
tokens = self.examples[idx]
# Input: all tokens except last one
# Target: all tokens except first one (shifted by 1)
x = torch.tensor(tokens[:-1], dtype=torch.long)
y = torch.tensor(tokens[1:], dtype=torch.long)
return x, y
This dataset class is more than just a container – it’s an intelligent data preparer. It performs several critical tasks:
- Tokenization: Converting all text to numerical token IDs
- Chunking: Breaking long documents into training-friendly sequences
- Input/Target Creation: Setting up the “predict the next token” task
- Batching: Preparing data for efficient GPU processing
Notice the clever way we set up our inputs and targets:
for a sequence like [A, B, C, D], the input becomes [A, B, C] and the target becomes [B, C, D]. This teaches the model to predict the next token given the previous ones – the core of language modelling.
Creating Data Loaders for Efficient LLM Training
To efficiently feed data to our model during training, we’ll use PyTorch’s DataLoader:
def prepare_dataset_from_huggingface(dataset, tokenizer, max_length=128, column_name='text'):
"""
Convert a Hugging Face dataset into our custom MiniGPT format.
"""
# Extract all texts from the dataset
all_texts = dataset[column_name]
# Create our custom dataset
return MiniGPTDataset(all_texts, tokenizer, max_length)
def create_dataloader(dataset, batch_size=16, shuffle=True, num_workers=2):
"""Create a DataLoader for efficient batch processing."""
return DataLoader(
dataset,
batch_size=batch_size,
shuffle=shuffle,
num_workers=num_workers,
pin_memory=True # This helps speed up data transfer to GPU
)
These functions create an optimized data pipeline. Think of DataLoaders as conveyor belts in a factory, delivering perfectly sized batches of data to our model. They handle:
- Shuffling data for better learning (like shuffling flashcards)
- Parallelizing data loading across CPU cores
- Transferring data efficiently between CPU and GPU
- Maintaining consistent batch sizes
Visualizing Our Dataset
Let’s visualize some examples from our dataset to better understand what our model will learn from:
def visualize_dataset_examples(dataset, tokenizer, num_examples=3):
"""Visualize some examples from our dataset."""
import random
indices = random.sample(range(len(dataset)), num_examples)
for i, idx in enumerate(indices):
x, y = dataset[idx]
# Decode tokens back to text
input_text = tokenizer.decode(x)
target_text = tokenizer.decode(y)
print(f"\n--- Example {i+1} ---")
print(f"Input tokens shape: {x.shape}")
print(f"Target tokens shape: {y.shape}")
print(f"\nInput text snippet: \"{input_text[:100]}...\"")
print(f"Target text snippet: \"{target_text[:100]}...\"")
# Show how they overlap
print("\nNotice how the target is shifted by one token:")
for j in range(min(5, len(x))):
print(f"Input token {j}: '{tokenizer.decode([x[j]])}'")
print(f"Target token {j}: '{tokenizer.decode([y[j]])}'")
Visualization helps us confirm our data preparation is working correctly. It shows:
- How text is broken into token sequences
- The relationship between input and target sequences
- The shifting pattern that creates our next-token prediction task
Understanding Next-Token Prediction
To better grasp how our model learns, let’s visualize the next token prediction task:
def visualize_next_token_prediction():
"""Visualize how next-token prediction works."""
plt.figure(figsize=(12, 6))
input_text = "The cat sat on the"
possible_next = ["mat", "chair", "dog", "roof", "floor"]
probabilities = [0.45, 0.3, 0.05, 0.1, 0.1]
# Top: input sequence
for i, word in enumerate(input_text.split()):
plt.text(i, 1.0, word, ha='center', va='center',
bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.5))
# Draw arrows between words
if i < len(input_text.split()) - 1:
plt.arrow(i + 0.25, 1.0, 0.5, 0, head_width=0.05, head_length=0.1,
fc='black', ec='black')
# Bottom: possible next tokens with probabilities
for i, (word, prob) in enumerate(zip(possible_next, probabilities)):
plt.text(len(input_text.split()) + i/len(possible_next), 0.5, word,
ha='center', va='center', rotation=45,
bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.5))
# Draw probability arrows
for i, prob in enumerate(probabilities):
arrow_width = prob * 2
plt.arrow(len(input_text.split())-0.25, 0.95,
i/len(possible_next), -0.35,
head_width=0.05, head_length=0.1,
width=arrow_width/40,
fc='red', ec='red', alpha=prob)
This visualization brings our core training objective to life. Imagine the model reading “The cat sat on the” and calculating probabilities for what might come next: “mat” (45%), “chair” (30%), “floor” (10%), etc.
Next-token prediction is elegant in its simplicity yet powerful in its results. By learning to predict what comes next in a sequence, the model naturally develops:
- Understanding of grammar and syntax
- Grasp of factual relationships
- Awareness of narrative patterns
- Sense of logical flow
What We’ve Learned so far in LLM Training
In this article, we’ve:
- Set up our development environment
- Explored and downloaded a suitable dataset
- Learned how tokenization converts text to numbers
- Built a custom dataset class for efficient training
- Created utilities for data processing and visualization
We now have everything we need to start building our model! In the next part, we’ll implement each component of the Mini-GPT architecture, understanding how they work together to process and generate text.
Coming Next in LLM Training – Part 3
We’ll dive into the core components of our Mini-GPT:
- Token embeddings: Giving meaning to numbers
- Positional encodings: Adding location-awareness
- Attention mechanisms: The “secret sauce” of transformers
- Feed-forward networks: Processing information
- Assembling a complete transformer model
Stay tuned as we bring these building blocks together to create our language model! 🚀
Exercise for the Reader
Before moving to LLM training Part 3, try these exercises:
- Experiment with tokenizing different types of text (code, poetry, technical writing)
- Create a custom dataset from your favourite book or article
- Visualize the distribution of token lengths in your dataset
Happy coding! Check out our GitHub repository for the complete implementation of these functions and classes.
For readers interested in the efficiency and practicality of smaller AI models, check out our article on Small Language Models: A Breakthrough in AI Sustainability. It explores how these lightweight alternatives deliver impressive results with significantly fewer resources — perfect for understanding the future direction of sustainable AI development!
One response to “LLM Training Simplified: Building Your First Language Model – 2”
[…] LLM Training Simplified: Building Your First Language Model – 2 […]