Transformers, the tech behind LLMs | Deep Learning Chapter 5
A visual, ground-up explanation of how data flows through a Transformer — from tokenization and word embeddings through attention blocks and MLP layers to final probability prediction. Uses GPT-3's specific architecture numbers to make the abstract concrete, and lays the conceptual groundwork needed to understand the attention mechanism. ---
Key Concepts
Notes
§How LLMs Generate Text
- Model takes in a text snippet, outputs a probability distribution over possible next tokens
- To generate: sample from that distribution → append to text → repeat
- This is exactly what happens when ChatGPT produces one word at a time
- GPT-2 (small) produces incoherent stories; GPT-3 (same architecture, ~100× larger) produces coherent ones
- Chatbot behavior is achieved by prepending a system prompt that establishes the "helpful AI assistant" context, then letting the model autocomplete dialogue
§High-Level Data Flow Through a Transformer
§Deep Learning Background
- Input: always an array (tensor) of real numbers
- Layers: data is progressively transformed through many intermediate arrays
- Parameters (weights): learned during training; interact with data only via weighted sums (i.e., matrix-vector multiplication)
- Backpropagation: the training algorithm; requires models to follow a specific format
- GPT-3 has 175 billion weights organized into ~28,000 distinct matrices across 8 categories
§Tokenization and Embedding
- GPT-3 vocabulary: 50,257 tokens
- Embedding dimension: 12,288
- Embedding matrix size: 50,257 × 12,288 ≈ 617 million weights
- Each token is initially just its column from the embedding matrix — no context yet
- The network's job is to progressively enrich each vector with contextual meaning
§Semantic Structure of the Embedding Space
- Similar words land near each other in the high-dimensional space
- Directions carry semantic meaning (not just individual points)
- Classic example:
king − man + woman ≈ queen - Other examples:
Italy − Germany + Hitler ≈ Mussolini;Germany − Japan + Sushi ≈ Bratwurst - Dot product measures alignment between vectors:
- Positive → vectors point in similar directions
- Zero → perpendicular
- Negative → opposite directions
- Useful for testing semantic directions (e.g., a "plurality direction":
cats − cat)
§Context Size
- GPT-3 context size: 2,048 tokens
- Data flowing through the network: array of 2,048 columns × 12,288 dimensions
- Exceeding context size = model "forgets" earlier parts of conversation
§The Unembedding Matrix and Final Prediction
- A second matrix maps the last vector in the context to 50,257 logits (one per token)
- During training, every vector in the final layer simultaneously predicts what comes after its position — more efficient
- Unembedding matrix: 50,257 × 12,288 ≈ another ~617 million weights
- Running total so far: ~1.2 billion (out of 175 billion total)
§Softmax
- Converts arbitrary real numbers (logits) into a valid probability distribution (values ∈ [0,1], sum = 1)
- Mechanics: exponentiate each value → divide each by the sum
- Largest logit dominates but smaller values still get weight — "softer than argmax"
- Temperature T: inserted as a divisor in the exponent
- Higher T → more uniform distribution → more creative / risky outputs
- Lower T → distribution more peaked → more predictable outputs
- T = 0 → always picks the single most probable token (deterministic)
Actionable Takeaways
- When reading about Transformers, mentally separate weights (learned, static during inference) from data (the changing vectors flowing through) — they play fundamentally different roles
- To build intuition for attention, first get comfortable with dot products as similarity measures and softmax as a normalization tool — both appear repeatedly inside attention blocks
- Use the GPT-3 parameter count as a sanity-check scaffold: embedding matrix (~617M) + unembedding matrix (~617M) account for ~1.2B of 175B total; the remaining ~174B live in attention and MLP layers
- Think of each token vector not as a fixed word representation but as a mutable container that accumulates contextual meaning as it passes through the network
Quotes Worth Keeping
You should think of them as having the capacity to soak in context. A vector that started its life as the embedding of the word King might progressively get tugged and pulled by various blocks in this network so that by the end it points in a much more specific and nuanced direction.
You should draw a very sharp distinction in your mind between the weights of the model — the actual brains, the things learned during training — and the data being processed, which simply encodes whatever specific input is fed into the model for a given run.
It really doesn't feel like this should actually work.