Understanding Transformers — The Architecture Behind Modern LLMs

The Transformer architecture is the backbone of modern Large Language Models (LLMs) like GPT, BERT, and T5.
Introduced in the landmark paper “Attention Is All You Need” (Vaswani et al., 2017), it replaced RNNs and CNNs in sequence modeling with a highly parallelizable self-attention mechanism.

This post breaks down the architecture and all its nitty-gritty components.

🚀 Motivation

Before Transformers, we had RNNs and LSTMs for sequence modeling.
But they suffered from:

Sequential processing (slow, hard to parallelize)
Vanishing/exploding gradients
Difficulty handling long-range dependencies

Transformers solved these with attention, allowing models to look at all tokens simultaneously.

🧩 Transformer Components

At a high level, the Transformer consists of an Encoder and a Decoder.

Encoder

Takes the input sequence (e.g., words, tokens).
Builds contextualized embeddings.

Decoder

Generates the output sequence step by step.
Used in tasks like machine translation or text generation.

🔑 Input Embeddings & Positional Encoding

Since Transformers don’t have recurrence like RNNs, they need a way to encode word order.
They use positional encodings (sinusoidal or learned vectors).

import torch
import math

def positional_encoding(seq_len, d_model):
    pos = torch.arange(seq_len).unsqueeze(1)
    i = torch.arange(d_model).unsqueeze(0)
    angle_rates = 1 / torch.pow(10000, (2*(i//2))/d_model)

    angle_rads = pos * angle_rates
    pe = torch.zeros(seq_len, d_model)
    pe[:, 0::2] = torch.sin(angle_rads[:, 0::2])  # even indices
    pe[:, 1::2] = torch.cos(angle_rads[:, 1::2])  # odd indices
    return pe

🌀 Self-Attention

The core innovation of Transformers.
Each word attends to every other word to capture contextual meaning.

The formula:

\[Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right) V\]

Where:

Q = Queries
K = Keys
V = Values

Example:

If the sentence is “The cat sat on the mat”,

The word “cat” can attend strongly to “sat”, “mat”, etc.

🎭 Multi-Head Attention

Instead of one attention mechanism, we use multiple in parallel (heads).
This allows the model to learn different types of relationships.

class MultiHeadAttention(torch.nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0
        self.d_head = d_model // num_heads
        self.qkv_proj = torch.nn.Linear(d_model, 3*d_model)
        self.o_proj = torch.nn.Linear(d_model, d_model)
        self.num_heads = num_heads

    def forward(self, x):
        B, T, C = x.size()
        qkv = self.qkv_proj(x).reshape(B, T, self.num_heads, 3*self.d_head)
        q, k, v = qkv.chunk(3, dim=-1)
        attn = (q @ k.transpose(-2, -1)) / (self.d_head ** 0.5)
        attn = attn.softmax(dim=-1)
        out = (attn @ v).transpose(1,2).reshape(B, T, C)
        return self.o_proj(out)

🔄 Feed Forward Networks (FFN)

Each layer also has a simple MLP:
Two linear layers with a ReLU/GELU activation in between.

⚖️ Layer Normalization & Residuals

Residual connections help stabilize training.
Layer normalization ensures smoother gradient flow.

📚 Scaling Laws & Training

Recent research shows that Transformer performance improves predictably with:

Model size (parameters)
Dataset size
Compute

This led to scaling laws powering GPT-3, PaLM, and other frontier models.

✅ Checklist for Understanding

“Attention is not just a mechanism, it’s the foundation of intelligence in sequence models.”
— Inspired by Vaswani et al., 2017

🖼️ Diagram

You can include the classic architecture diagram here:

⚡ Final Thoughts

The Transformer revolutionized deep learning.
From translation to LLMs like GPT-4, Claude, and LLaMA,
its self-attention-first design is the core reason behind today’s AI boom.