LLMs: Architecture & Applications¶

LLMs Transformers Architecture Scaling

📅 May 2025 ✍️ Sudhir Mishra ⏱️ 22 min read

Executive Summary¶

Large Language Models (LLMs) are the defining technology of this era of AI. In under a decade, language models went from generating plausible-but-meaningless text to passing professional exams, writing production code, and engaging in sophisticated multi-step reasoning.

How did this happen? The answer is three interlocking breakthroughs: the transformer architecture, massive scale (of data, compute, and parameters), and alignment techniques that make raw capability usable.

This paper builds up the full picture — from the attention mechanism to scaling laws to real-world deployment — so you understand not just what LLMs can do, but why they work.

1. A Brief History: From RNNs to Transformers¶

The Recurrent Era¶

Before transformers, sequence modeling was dominated by Recurrent Neural Networks (RNNs) and their gated variants — LSTMs and GRUs. RNNs processed text token by token, maintaining a hidden state that summarized everything seen so far.

The fundamental problem: vanishing gradients. When backpropagating through hundreds or thousands of time steps, gradients shrink exponentially. RNNs simply couldn't learn long-range dependencies — what happened 500 tokens ago might as well not exist.

Attention mechanisms were introduced as a patch: instead of relying entirely on the hidden state, the model could "attend" back to any previous position. This worked well, but the recurrent backbone remained a bottleneck — it couldn't be parallelized.

The Transformer Breakthrough (2017)¶

In 2017, Vaswani et al. published "Attention Is All You Need", introducing the Transformer — a model that replaced recurrence entirely with attention.

The key insight: if attention lets you relate any two positions in a sequence, you don't need recurrence at all. Every position can attend to every other position in parallel, making transformers massively amenable to GPU acceleration.

This single architectural change unlocked the scaling that made modern LLMs possible.

2. Transformer Architecture: How It Works¶

The Big Picture¶

A standard transformer encoder-decoder consists of:

An embedding layer that converts tokens to dense vectors
Positional encodings that inject sequence order
A stack of transformer blocks, each containing:
- Multi-head self-attention
- Feed-forward network (FFN)
- Layer normalization and residual connections
A final output projection to vocabulary logits (for language models)

Modern LLMs typically use a decoder-only architecture (no encoder), trained autoregressively: predict the next token given all previous tokens.

2.1 Tokenization¶

Before any neural network computation, text is converted to tokens — sub-word units typically generated by Byte Pair Encoding (BPE) or SentencePiece.

Common tokenizers split "unbelievable" into ["un", "believ", "able"] — roughly 1.3 tokens per word on average for English text. A 4,000-word essay becomes ~5,000 tokens.

Why sub-word? Whole-word vocabularies are too large; character vocabularies are too granular. Sub-words balance vocabulary size with sequence length.

2.2 Embeddings and Positional Encoding¶

Each token is mapped to a high-dimensional embedding vector (e.g., 4096 dimensions in LLaMA-3). This learned representation captures semantic meaning — words with similar meanings cluster in embedding space.

Since self-attention is permutation-invariant (it doesn't know the order of tokens), we must explicitly inject positional information. Two approaches dominate:

Absolute Positional EncodingRotary Position Embedding (RoPE)

Fixed sinusoidal functions added to embeddings. Simple but struggles to generalize to sequences longer than those seen in training.

Encodes position relative to other tokens by rotating query and key vectors. Used by LLaMA, Mistral, and most modern open-source models. Generalizes better to longer contexts.

2.3 Self-Attention: The Core Mechanism¶

Self-attention is the heart of the transformer. For each token, it asks: "which other tokens in this sequence are most relevant to understanding me?"

Formally, given input vectors, we compute three projections:

Q (Query): What am I looking for?
K (Key): What do I contain?
V (Value): What information do I provide?

The attention score between two positions is the dot product of Q and K, scaled and softmaxed:

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

This produces a weighted sum of values, where positions most relevant to the current token get the most weight.

Multi-head attention runs this process in parallel with different learned Q/K/V projections (say, 32 "heads"), allowing the model to attend to different aspects simultaneously — syntax in one head, coreference in another, semantics in a third.

Quadratic Complexity

Standard self-attention is O(n²) in sequence length — attending from every token to every other token. For a 128K-token context, this means ~16 billion attention pairs. Techniques like FlashAttention make this computationally feasible by fusing operations and avoiding materializing the full attention matrix in memory.

2.4 Feed-Forward Network¶

After attention, each position passes through a position-wise FFN — a simple two-layer MLP applied independently to each token vector:

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

Despite its simplicity, the FFN contains the majority of a transformer's parameters. Research suggests FFNs act as key-value memories — storing factual associations learned during pre-training.

2.5 Layer Norm and Residual Connections¶

Each sub-component (attention + FFN) is wrapped with:

Residual connections: The input is added back to the output, creating a "highway" for gradient flow
Layer normalization: Stabilizes training by normalizing activations

Modern models use Pre-LN (normalize before the sub-layer) rather than the original Post-LN, which improves training stability at large scales.

3. Pre-Training: Learning from the Internet¶

Next-Token Prediction¶

LLMs are trained with a deceptively simple objective: predict the next token. Given a sequence, the model learns to assign high probability to the actual next token and low probability to everything else.

This is unsupervised — no labels required. The "label" for any token is just the next token in the document. This is why the entire internet is training data.

Despite its simplicity, next-token prediction forces the model to learn:

Grammar and syntax (to predict the next word)
Factual knowledge (to predict the right noun/name)
Reasoning (to predict the logical next step)
Style and tone (to match the document's register)

Language modeling is perhaps the richest self-supervised objective ever devised.

Training Data and Scale¶

Modern frontier models train on trillions of tokens from:

Web crawls (CommonCrawl, C4)
Books and academic papers
Code repositories (GitHub)
Wikipedia, news, forums

Data quality matters enormously. A smaller dataset of cleaner, more diverse text often outperforms a larger noisy one. Deduplication, quality filtering, and careful domain mixing are active research areas.

4. Scaling Laws: Why Bigger Works¶

One of the most striking empirical findings in deep learning is that performance scales predictably with compute, data, and parameters.

The Chinchilla Result¶

Hoffmann et al. (2022) showed that most models at the time were under-trained — they used too many parameters relative to training data. The "Chinchilla" scaling laws state:

For a given compute budget, optimal performance is achieved with roughly 20 tokens of training data per parameter.

This means a 70B-parameter model should train on ~1.4 trillion tokens — much more than GPT-3 (which trained a 175B model on 300B tokens). Chinchilla-optimal models are smaller but train longer, giving better performance per parameter.

Power Laws¶

LLM loss follows a power law relationship with scale:

L(N, D) ∝ (N_opt/N)^α + (D_opt/D)^β

Where N = parameters, D = training tokens. This means performance improvements are predictable — you can extrapolate how much better a 10× larger model will be before you train it. This property has been transformative for lab planning and investment decisions.

5. Alignment: From Raw Capability to Useful Assistant¶

A pre-trained LLM is a text completion engine. Ask it a question and it might generate more questions — because the training data is full of Q&A forums where questions follow questions.

Making the model helpful, harmless, and honest requires alignment training.

Instruction Fine-Tuning (SFT)¶

The first step is Supervised Fine-Tuning on instruction-response pairs. Curated datasets of (instruction, good response) examples teach the model the expected format for following instructions.

Even a small amount of high-quality SFT data dramatically improves instruction following. The LIMA paper showed that 1,000 carefully curated examples could match models trained on much larger SFT datasets.

RLHF: Reinforcement Learning from Human Feedback¶

SFT teaches format; RLHF teaches quality. The process:

Collect preference data: Show human raters pairs of model outputs; they label which is better
Train a reward model: A separate model learns to predict human preference scores
RL fine-tuning: Use PPO (or similar) to optimize the LLM to maximize reward model score, while staying close to the SFT model (to prevent reward hacking)

RLHF is what made ChatGPT feel like a qualitatively different product from GPT-3. It's also expensive and finicky — the reward model can be gamed, and PPO at scale is unstable.

Direct Preference Optimization (DPO)¶

DPO (Rafailov et al., 2023) skips the explicit reward model, directly optimizing the LLM on preference pairs. It's simpler, more stable, and has become the dominant alignment technique for open-source models.

6. Key Models: A Timeline¶

2018

GPT-1 (OpenAI) — 117M parameters. Demonstrated that pre-training + fine-tuning works.

2019

GPT-2 (OpenAI) — 1.5B parameters. So good at text generation that OpenAI initially refused to release it.

2020

GPT-3 (OpenAI) — 175B parameters. Few-shot prompting emerged as a capability. Launched the API economy.

2022

ChatGPT (OpenAI) — GPT-3.5 + RLHF. 100M users in 2 months. Changed the world.

2023

GPT-4, Claude 2, LLaMA-2 — Multimodality, longer contexts, open-source proliferation.

2024

Claude 3, Gemini 1.5, LLaMA-3 — 1M+ token contexts, frontier open-source models, reasoning benchmarks.

2025

GPT-4o, Claude 3.5/3.7, Gemini 2 — Native multimodality, extended thinking, agentic integration.

7. Applications¶

Modern LLMs power a remarkable breadth of applications:

Category	Examples	Key Capability Used
Coding	GitHub Copilot, Cursor	Code generation, completion, debugging
Writing	Jasper, Notion AI	Long-form generation, editing, summarization
Search	Perplexity, Bing AI	Information retrieval + synthesis
Customer Support	Intercom, Zendesk	Intent understanding, knowledge retrieval
Healthcare	Nuance DAX	Clinical note generation, summarization
Legal	Harvey, Casetext	Document review, contract analysis
Education	Khan Academy Khanmigo	Tutoring, explanation, Socratic dialogue
Finance	Bloomberg GPT	Earnings analysis, financial Q&A

The pattern: LLMs provide language understanding + generation on top of domain-specific data.

8. Limitations and Failure Modes¶

Understanding limitations is as important as understanding capabilities:

Hallucination

LLMs generate plausible-sounding but false information. They optimize for token probability, not factual accuracy. Always verify critical facts from LLM outputs.

Knowledge Cutoff

Pre-training data has a cutoff date. LLMs don't know about events after their training. RAG and web search are standard mitigations.

Context Window Constraints

Even with 1M-token windows, performance degrades on information buried deep in long contexts (the "lost in the middle" phenomenon). Don't assume uniform attention across a long context.

Reasoning Brittleness

LLMs can fail on basic arithmetic, logic puzzles, or spatial reasoning — tasks trivially handled by rule-based systems. Chain-of-thought helps but doesn't eliminate the problem.

Prompt Sensitivity

Small changes in phrasing can produce dramatically different outputs. LLMs are not robust to prompt variation in the way that well-engineered software should be.

9. What's Next¶

The field is moving fast. Near-term directions:

Longer contexts: Pushing beyond 1M tokens with better attention mechanisms (sparse, linear)
Better reasoning: Models trained explicitly to "think" before answering (o1/o3, DeepSeek-R1)
Efficiency: Smaller models matching larger ones via better data curation and distillation
Multimodality: Native video, audio, and image understanding baked in, not bolted on
Continuous learning: Models that update from user interactions without full retraining

Key Takeaways¶

Attention is the core primitive Self-attention lets every token relate to every other token — this is why transformers can capture long-range dependencies that RNNs couldn't.

Scale has been the driver Scaling parameters, data, and compute together — following predictable power laws — has been the main engine of LLM progress.

Next-token prediction learns everything The simplest possible self-supervised objective — predict the next word — turns out to be rich enough to learn grammar, facts, reasoning, and style.

Alignment makes it usable RLHF and DPO transform a raw text predictor into a helpful assistant. Without alignment, capability doesn't translate to usefulness.