Part I · ML Foundations
Chapter 2 Core ~24 min read

The forward pass: a neural network as a pure function

"*"All models are wrong, but some are differentiable."* — paraphrased, with apologies to George Box"

In Chapter 1 we built up the data structure: a tensor is an n-dimensional array with (shape, dtype, device). In this chapter we use that data structure to build the simplest interesting object in machine learning: a neural network, viewed as a pure mathematical function from a tensor to a tensor.

The framing matters. Once you can hold a neural network in your head as a single composed function — no magic, no black box, just f(g(h(x))) with learned weights — every later chapter becomes easier. The transformer is a function. Backprop is the mechanical chain rule applied to that function. Training is finding parameters that minimize a scalar over that function. Inference is one evaluation of that function. Quantization, distillation, and serving are all just transformations of this function. There is nothing else.

Outline:

  1. The linear layer is just y = Wx + b.
  2. What “pure function” means in this context, and why it matters.
  3. The activation function and the necessity of nonlinearity.
  4. The activation zoo — ReLU, GELU, SiLU/Swish, Sigmoid, Tanh.
  5. Stacking layers as function composition: the MLP.
  6. Hand-tracing a tiny MLP.
  7. Initialization: why bad init silently kills training.
  8. Normalization preview — BatchNorm, LayerNorm, RMSNorm.
  9. Residual connections preview.
  10. The forward pass mental model.

2.1 The linear layer is y = Wx + b

The most fundamental building block of every neural network is the linear layer, also called the fully connected layer or dense layer. It is exactly this:

y = W x + b

where:

  • x is an input vector of shape (in_features,)
  • W is a weight matrix of shape (out_features, in_features)
  • b is a bias vector of shape (out_features,)
  • y is an output vector of shape (out_features,)

It is one matrix-vector product plus an additive bias. That is the entire layer. There is no other content.

In PyTorch:

import torch
import torch.nn as nn

layer = nn.Linear(in_features=512, out_features=2048)
x = torch.randn(512)
y = layer(x)              # shape (2048,)

# under the hood, this is exactly:
y_manual = layer.weight @ x + layer.bias
torch.allclose(y, y_manual)   # True

Two non-obvious points:

Linear layer as matrix-vector multiply: input vector x (in_features), weight matrix W (out x in), output y (out_features). x (in_features,) @ W (out_features, in_features) + b = y (out_features,)
A linear layer is exactly one matrix multiply plus a bias add — the weight matrix W has shape (out_features, in_features) so that W @ x works directly without a transpose.

(1) PyTorch stores W as (out, in), not (in, out). This is a convention choice — it means W @ x works directly, with no transpose. NumPy people who’ve internalized “rows are samples” find this jarring at first.

(2) The “linear layer” is actually an affine map, not a linear one in the strict mathematical sense, because of the bias. A truly linear map maps the zero vector to the zero vector. Wx + b doesn’t (unless b = 0). The ML community calls it linear anyway. Mathematicians grimace; you ignore them.

Batched linear

In practice, you never apply a linear layer to a single vector. You apply it to a batch:

x = torch.randn(32, 512)        # batch of 32 vectors
y = layer(x)                    # shape (32, 2048)

PyTorch handles this by treating the last dimension as the “feature” dimension and broadcasting the linear over all leading dimensions:

x = torch.randn(8, 16, 512)     # batch of 8 sequences of 16 vectors each
y = layer(x)                    # shape (8, 16, 2048)

This is exactly what you want for transformers, where x typically has shape (N, S, D) and you want to apply the same linear independently to every position in every sequence in the batch. The linear layer is rank-polymorphic: it operates on the last dim and leaves all leading dims alone.

The math is the same: for each combination of leading indices, you compute y[..., :] = W @ x[..., :] + b. The implementation uses a single batched matmul kernel, which is one of the most heavily optimized operations on modern GPUs.

Counting parameters

A linear layer of shape (in, out) has in * out + out parameters: the weight matrix plus the bias. For a 4096→16384 layer, that’s about 67 million parameters in just one layer. A 70B-parameter language model is mostly a stack of linear layers; the count is dominated by them.

layer = nn.Linear(4096, 16384)
sum(p.numel() for p in layer.parameters())   # 67_125_248

2.2 What “pure function” means

A pure function in mathematics is one that:

  1. Always returns the same output for the same input.
  2. Has no side effects.

A neural network — at inference time, with frozen weights — is exactly this. Given the same input tensor and the same weights, the forward pass always produces the same output (modulo some hardware-level nondeterminism we’ll address later). It does not modify global state, write to disk, or make network calls. It is a pure function from (input, weights) to output.

This framing is the bedrock of the entire field for two reasons:

(1) Differentiability. Pure functions made of differentiable primitives are themselves differentiable. The chain rule of calculus tells us how to compute derivatives of compositions of differentiable functions. That is the entire content of backpropagation, which we’ll cover in Chapter 3.

(2) Composability. If f, g, and h are pure functions, then so is f(g(h(x))). There is no hidden state between them. You can wire them together however you like, swap one out for another, or fuse them into a single optimized kernel — and the meaning is preserved. This is why every modern ML framework (PyTorch, JAX, TensorFlow) treats models as compositions of operations on tensors. It’s also why JAX leans into the functional purity hard with jit, grad, vmap — those transformations only work on pure functions.

The thing that breaks purity is state. Two examples that show up everywhere:

  • BatchNorm in training mode updates running statistics on every forward pass. It is not pure. This is why BatchNorm has a “training” mode and an “eval” mode that behave differently, and why forgetting model.eval() before inference is one of the most common bugs in PyTorch code.
  • Dropout uses a random number generator, so it is not deterministic across calls. You can fix the seed, but if you don’t, dropout breaks reproducibility.

These are the exceptions. The vast majority of operations in a transformer are pure, which is why we can reason about them as functions.

2.3 Why nonlinearity matters

If you stack two linear layers without anything between them:

Two linear layers without activation collapse to one linear layer; with a nonlinearity they cannot. No activation (collapses to one layer) x W₁x+b₁ W₂y+b₂ ≡ W′x + b′ (one linear) With activation (cannot collapse) x W₁x+b₁ σ( ) W₂h+b₂ genuinely deeper
Without a nonlinearity, any depth of linear layers reduces to a single affine map — the activation (highlighted) is the only thing that makes depth meaningful.
y = W_2 (W_1 x + b_1) + b_2
  = (W_2 W_1) x + (W_2 b_1 + b_2)
  = W' x + b'

…you get another linear layer. Algebraically equivalent. The “depth” was a lie. You could have replaced the two-layer network with a single layer W' = W_2 W_1 and gotten the same function. No matter how many linear layers you stack, the result is always one big linear function — and linear functions can only fit linear data.

This is why every neural network has a nonlinear activation function between linear layers. The activation breaks the algebraic collapse: you can’t push the activation outside the matrix multiply or fold it into the next weight matrix. The composition σ(Wx + b) is genuinely more expressive than Wx + b, and σ(W_2 σ(W_1 x + b_1) + b_2) is genuinely more expressive than either.

The famous result that makes this work is the universal approximation theorem (Cybenko 1989, Hornik 1991): a feedforward network with a single hidden layer and any “squashing” nonlinearity can approximate any continuous function on a compact domain to arbitrary precision, given enough hidden units. The theorem doesn’t tell you how many hidden units you need (the answer is “exponentially many” for many functions), and it doesn’t tell you the parameters can be learned — it just tells you they exist. But it tells you that the architecture is not the bottleneck.

In practice, depth helps far more than width. A deep narrow network represents many functions far more parameter-efficiently than a shallow wide one — this is the empirical result that justifies the entire field.

2.4 The activation zoo

The activation function is applied element-wise to its input. Same shape in, same shape out, every entry transformed independently. The choice of activation matters less than people think (they’re all reasonable), but enough that you should know the menu.

ReLU — max(0, x)

The Rectified Linear Unit. The simplest possible nonlinearity that isn’t trivial. Negative inputs become zero; positive inputs pass through unchanged.

relu = lambda x: x.clamp(min=0)

ReLU’s killer feature is that its gradient is either 0 or 1. There’s no exponential, no division, no transcendental. The forward pass is one comparison and one move; the backward pass is one comparison. It’s very fast.

Its killer flaw is the dying ReLU problem: if a neuron’s input is consistently negative, its gradient is consistently zero, and it stops learning. The neuron is “dead.” This was one of the motivations for the variants below.

ReLU is the default for convnets and old-school MLPs. It is not the default in transformers anymore.

GELU — Gaussian Error Linear Unit

GELU is x * Φ(x), where Φ is the standard normal cumulative distribution function. Loosely, it’s “ReLU but with a smooth transition through zero.”

import torch.nn.functional as F
y = F.gelu(x)   # the real implementation uses a tanh approximation for speed

GELU is the activation used in BERT, GPT-2, GPT-3, T5, and most pre-2023 transformers. It is smooth (so its gradient is smooth too), it doesn’t have the dying ReLU problem, and it has a small “negative bump” near zero that gives it some empirical advantages. It is slower than ReLU but the cost is dwarfed by the matrix multiplies around it.

SiLU / Swish — x * sigmoid(x)

x * σ(x), where σ(x) = 1 / (1 + e^(-x)). Also called Swish (the original Google paper) and SiLU (the PyTorch name). Mathematically very close to GELU — both are smooth, both have a small negative bump, both are non-monotonic.

SiLU is the activation in Llama, Mistral, and most modern open LLMs. The reason isn’t a deep theoretical superiority over GELU; it’s mostly that the Llama paper used it and the open-source community copied them.

SwiGLU — the gated variant

Modern transformers don’t actually use a plain SiLU in the FFN. They use SwiGLU, which is a gated linear unit with SiLU as the gate:

SwiGLU(x) = (W_gate x) * silu(W_up x)

The intuition: instead of silu(W x) (one matmul, one activation), you compute two parallel projections, run silu on one, and use the result as a gate that multiplies the other. It costs an extra matmul (so it’s 1.5× the parameters of a plain FFN for the same hidden size), but the empirical quality at fixed parameter count is better. Modern models compensate by using a smaller FFN expansion factor (from 4× down to ~2.67× for SwiGLU FFNs).

This is the FFN you’ll see in Llama, Mistral, and most current models. It’s the reason a “modern” transformer FFN has three linear layers instead of two (gate, up, down). We’ll come back to it in Chapter 7.

Sigmoid and Tanh

σ(x) = 1 / (1 + e^(-x)) and tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x)). The classical activation functions. Sigmoid squashes to (0, 1); tanh squashes to (-1, 1).

You almost never use these as the activation in a hidden layer of a modern network — they suffer from the vanishing gradient problem at the saturation regions. Where you do see them is in gating roles: sigmoid as a “how much of this signal should pass” multiplier, tanh as a “what’s the candidate update” function. LSTMs and GRUs are full of them; Transformers use sigmoid in the SwiGLU gate (well, SiLU is sigmoid times x, so it’s hiding in there).

You also see sigmoid as the output activation for binary classification, mapping a single logit to a probability.

Softmax — the activation for the output

Softmax isn’t an element-wise activation; it operates on a whole vector. Given a vector of logits z, softmax produces a vector of probabilities:

softmax(z)_i = exp(z_i) / sum_j exp(z_j)

The output is a probability distribution: every entry is in (0, 1) and the entries sum to 1. This is the activation used at the output of a classifier (and at the output of the LLM head, where the “classes” are vocabulary tokens). It’s also the heart of attention, which we’ll meet in Chapter 6.

The thing to know about softmax for now: it’s numerically delicate because of the exp. Computing exp(z_i) for large z_i overflows; for very negative z_i it underflows. The standard fix is the log-sum-exp trick: subtract the max before exponentiating.

def softmax_safe(z):
    z_shifted = z - z.max(dim=-1, keepdim=True).values
    exp_z = z_shifted.exp()
    return exp_z / exp_z.sum(dim=-1, keepdim=True)

This is mathematically identical (the max cancels in numerator and denominator) but numerically stable. Every framework’s softmax does this internally; you don’t have to. But you should know why.

2.5 Stacking layers as function composition: the MLP

A multi-layer perceptron (MLP) is the simplest neural network with depth: a stack of (linear, activation, linear, activation, ..., linear) operations. The activation between linear layers is what gives the depth meaning.

Three-layer MLP data flow: input through linear, activation, linear, activation, linear to output, with tensor shapes annotated. input (N, in_dim) Linear W₁, b₁ (N, hidden) GELU Linear W₂, b₂ (N, hidden) GELU Linear W₃, b₃ (N, out_dim) logits f(x) = W₃ σ( W₂ σ( W₁x + b₁ ) + b₂ ) + b₃
A three-layer MLP is just three matrix multiplies and two nonlinearities chained — the output layer has no activation so its range is unconstrained (logits, not probabilities).
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, out_dim)
        self.act = nn.GELU()

    def forward(self, x):
        x = self.act(self.fc1(x))     # (N, hidden_dim)
        x = self.act(self.fc2(x))     # (N, hidden_dim)
        x = self.fc3(x)               # (N, out_dim)
        return x

There it is — a complete neural network in 12 lines. As a mathematical function:

f(x) = W_3 σ(W_2 σ(W_1 x + b_1) + b_2) + b_3

That’s it. Read that expression. Internalize it. Every later model in this book is going to feel intimidating until you remember it’s just a more elaborate version of this same function-composition pattern. The transformer is f(g(h(x))) where each step happens to be more interesting than a single linear; ResNet-50 is the same; even the largest LLMs are this expression with a longer chain.

Two stylistic notes:

  • The output layer typically has no activation, especially for regression and classification. A classification model emits logits (which softmax will turn into probabilities); a regression model emits a real number directly. Putting an activation on the output layer constrains the range and is almost always a mistake unless you have a specific reason.
  • The hidden dim being larger than the input/output dim is the conventional choice. The width gives the network capacity to represent intermediate features. In transformers, the FFN typically expands by 4× (or 2.67× for SwiGLU) before projecting back down.

2.6 Hand-tracing a tiny MLP

Let’s trace one forward pass through a network so small you can do it on paper. Two-input, three-hidden, one-output, with ReLU.

Set the weights to:

W_1 = [[ 1, -1],     b_1 = [0, 0, 0]
       [ 0,  1],
       [ 1,  1]]                      # shape (3, 2)

W_2 = [[1, 1, 1]]    b_2 = [0]        # shape (1, 3)

And the input x = [2, -1].

Step 1. Hidden pre-activation: h_pre = W_1 x + b_1.

h_pre[0] = 1*2 + (-1)*(-1) + 0 = 2 + 1 = 3
h_pre[1] = 0*2 +    1*(-1) + 0 = 0 - 1 = -1
h_pre[2] = 1*2 +    1*(-1) + 0 = 2 - 1 = 1

So h_pre = [3, -1, 1].

Step 2. Activation: h = ReLU(h_pre) = [max(0, 3), max(0, -1), max(0, 1)] = [3, 0, 1].

The middle neuron is dead for this input. Its gradient through the ReLU will also be zero, which is the dying ReLU phenomenon in action.

Step 3. Output: y = W_2 h + b_2 = 1*3 + 1*0 + 1*1 + 0 = 4.

That’s it. A complete forward pass of a 3-layer network in three lines of arithmetic. Every neural network you’ll ever work with is this, but with millions of neurons and a more interesting connectivity pattern.

The discipline: when a model misbehaves, do this trace mentally. Not on the full 70B model, but on the layer or block where you suspect the bug. Print the activations. Check the shapes. Most “the model isn’t learning” bugs surrender in 30 seconds when you actually look at one forward pass.

2.7 Initialization: why bad init silently kills training

The weights of a neural network have to start somewhere. If you initialize them all to zero, every neuron in a layer computes the same thing, every gradient is identical, and the network never breaks symmetry — it learns nothing. If you initialize them too large, activations explode; too small, they vanish.

The standard schemes:

  • Xavier / Glorot initialization (Glorot & Bengio 2010): draws from a distribution with variance 2 / (fan_in + fan_out). Designed for tanh / sigmoid networks where the variance should be preserved through the forward pass.
  • Kaiming / He initialization (He et al. 2015): draws from a distribution with variance 2 / fan_in. The factor of 2 compensates for the fact that ReLU zeros half the activations on average. This is the default for ReLU networks, and it’s what nn.Linear does in PyTorch (with a small modification) by default.

You will rarely need to think about initialization explicitly — every framework picks a sensible default per layer. But you should know:

  • Bad init makes a model train slowly or not at all. If your model isn’t learning and the loss isn’t moving, after checking your data and your loss function, the next thing to look at is whether the activations and gradients are exploding or vanishing through the layers. Print the mean and std of each layer’s output on a forward pass; healthy activations should have variance close to 1 (with appropriate normalization).
  • Transformers are surprisingly init-sensitive at large scale. “Truncated normal with std 0.02” is the conventional choice and is what GPT-2 used; many later models tweak it slightly. The Megatron paper has a careful analysis.

We will not spend more time on initialization. Just remember it exists, and when the model isn’t learning, it’s one of the things to check.

2.8 Normalization: a preview

A bare stack of linear → activation → linear → activation works for shallow networks but starts to break for deep ones. The activations drift in mean and variance as you go through the layers, gradients explode or vanish, and training becomes brittle.

Normalization layers fix this by re-centering and re-scaling the activations at strategic points in the network. The three you’ll meet:

  • BatchNorm (Ioffe & Szegedy 2015): normalize across the batch dimension. The first hugely successful normalization. Killer feature: faster training, less init sensitivity, regularization side effect. Killer flaw: depends on batch statistics, so it behaves differently in train vs eval, breaks for batch size 1, and breaks for tiny batches that don’t have meaningful statistics.
  • LayerNorm (Ba, Kiros & Hinton 2016): normalize across the feature dimension, independently per sample. Fixes BatchNorm’s batch-dependence. This is the normalization in the original transformer.
  • RMSNorm (Zhang & Sennrich 2019): like LayerNorm but only re-scales (no re-centering). Cheaper, surprisingly works just as well, and is used in Llama and most modern open models.

We’ll cover the differences in detail in Chapter 7. For now, just know:

  • Modern transformers use RMSNorm in two places per block: before attention, before the FFN. It’s a shape-preserving operation: input shape = output shape.
  • The normalization is applied per token, across the feature dimension. Each token in each sequence in each batch gets its features renormalized independently of the others. This is critical: it means LayerNorm/RMSNorm don’t introduce any cross-token coupling, which would interfere with the autoregressive property.

2.9 Residual connections: a preview

A second technique that makes deep networks trainable is the residual connection (He et al. 2015, the ResNet paper): instead of computing

y = f(x)

…you compute

y = x + f(x)

The f is some block of layers (a bottleneck in ResNet, an attention sub-block in a transformer). The point of the +x is that the gradient flows directly through the addition, so the deeper layers can’t completely starve the earlier ones of signal. This is the single most important architectural innovation that made networks deeper than ~20 layers trainable.

In transformers, residuals are everywhere. Every attention sub-block and every FFN sub-block is wrapped in a residual:

x = x + attention(norm(x))
x = x + ffn(norm(x))
Residual connection: gradient flows directly through the plus node bypassing the sub-block, ensuring deep layers always receive a non-zero gradient signal. x f(x) (block) + skip (identity) — gradient = 1 x + f(x) backward: ∂(x + f(x))/∂x = 1 + ∂f/∂x the +1 means gradient always reaches earlier layers, even if ∂f/∂x ≈ 0
The shortcut path (highlighted) carries a gradient of exactly 1 through the addition node, guaranteeing that even if a block's gradient vanishes, the earlier layers still receive a usable signal.

This forms what people call the residual stream: a single tensor x of shape (N, S, D) that flows through the entire network, with each layer adding a learned update to it. The residual stream is one of the most important mental models for understanding transformers, and we’ll come back to it in Chapter 7.

2.10 The forward pass mental model

The seven points to take into Chapter 3:

  1. A neural network is a function: y = f(x; θ). Frozen weights make it pure.
  2. The atomic building block is the linear layer: y = Wx + b. Everything else is composition and nonlinearity.
  3. Without nonlinearity, depth collapses. The activation is what makes “deep” mean anything.
  4. The activation menu is small: ReLU for old conv nets, GELU for early transformers, SwiGLU (with SiLU as the gate) for modern transformers. The exact choice matters less than the matmuls around it.
  5. An MLP is f(x) = W_n σ(... σ(W_2 σ(W_1 x + b_1) + b_2) ... + b_n). Read that until it’s burned in.
  6. Initialization matters more than people think. Trust the default; check it when training stalls.
  7. Normalization and residuals are the two structural inventions that make deep networks trainable. We’ll cover both in detail in Chapter 7.

In Chapter 3, we add gradients, and the function turns into something that can be learned.


Read it yourself

  • Michael Nielsen’s Neural Networks and Deep Learning, chapters 1 and 2 — the canonical undergraduate-style introduction. Free online. Read both chapters slowly.
  • The 3Blue1Brown YouTube series Neural Networks, episodes 1 and 2 — the visual intuition for everything in this chapter. Watch both.
  • Deep Learning, Goodfellow, Bengio & Courville, chapter 6 — the rigorous version with all the proofs. Optional but rewarding.
  • The original ResNet paper (He et al., 2015), Deep Residual Learning for Image Recognition — the residual idea, with the empirical evidence.

Practice

  1. Show algebraically that two stacked linear layers without an activation are equivalent to one linear layer. Write down the resulting W' and b' in terms of W_1, b_1, W_2, b_2.
  2. Implement softmax in PyTorch yourself, including the log-sum-exp trick. Verify it matches torch.softmax on a vector containing both 1e9 and -1e9.
  3. Hand-trace a forward pass through this network, by hand on paper, with input x = [1, 1]:
    W_1 = [[2, 0], [0, -1]]   b_1 = [0, 1]
    activation = ReLU
    W_2 = [[1, 1]]            b_2 = [-1]
    What is y?
  4. Why does the SwiGLU FFN use 2.67× expansion instead of 4×? (Hint: count parameters of W_gate, W_up, W_down vs W_up, W_down.)
  5. Pick a layer in any pretrained model in HuggingFace (Llama-3.1-8B is a good choice). Print model.model.layers[0].mlp and identify each sub-layer. Match it to the SwiGLU formula in §2.4. Which linear is W_gate, which is W_up, which is W_down?
  6. Stretch: Implement an MLP from scratch in NumPy (no PyTorch). Train it on the XOR problem ([(0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0]). You can implement gradient descent by hand — derivative of MSE is straightforward. The hidden layer needs to be at least 2 wide. This exercise is annoying. Do it anyway. It cements the chapter.

Concept check

4 questions. Click a choice to check. Your score is saved locally.

Score
0 / 4
  1. 1. For a linear layer with in_features=512 and out_features=256, how many trainable parameters does it have (including bias)?
  2. 2. Why is a nonlinear activation function necessary between linear layers?
  3. 3. GELU is preferred over ReLU in modern transformers primarily because it is
  4. 4. Xavier/Glorot initialization sets initial weights to have variance proportional to 1/(fan_in + fan_out). What failure mode does this prevent?
Related chapters