Following our exploration of polysemanticity and monosemanticity in Part 3, we now turn to the specific architecture that dominates modern AI: the Transformer. To understand the learned algorithms within these powerful models, we need a robust mathematical framework that allows for systematic decomposition and analysis of their internal workings, particularly their attention mechanisms and information flow. This part lays out such a framework, building on the conceptual foundations from Part 1 and Part 2.

Deconstructing the Transformer

Transformers, while complex, possess a highly structured architecture that lends itself to mathematical decomposition. Key components include token embeddings, positional encodings, multiple layers of attention and MLP (Multi-Layer Perceptron) blocks, and a final unembedding step. For the development of a clear mathematical framework, we often initially simplify by focusing on attention-only models or omitting biases and layer normalization, as these can be added back later without fundamentally altering the core computational pathways.

The Residual Stream: A Central Communication Bus

The residual stream is arguably the most critical architectural feature for enabling mechanistic analysis. At each layer ll, the output x(l)\mathbf{x}^{(l)} is the sum of the input from the previous layer x(l1)\mathbf{x}^{(l-1)} and the computations performed by the current layer’s components (e.g., attention head outputs, MLP outputs):

x(l)=x(l1)+kComponentOutputk(l)(x(l1))\mathbf{x}^{(l)} = \mathbf{x}^{(l-1)} + \sum_k \text{ComponentOutput}_k^{(l)}(\mathbf{x}^{(l-1)})

where ComponentOutputk(l)(x(l1))\text{ComponentOutput}_k^{(l)}(\mathbf{x}^{(l-1)}) is the output of the kk-th component in layer ll (e.g., an attention head or an MLP block), which itself is a function of the input to that layer, x(l1)\mathbf{x}^{(l-1)}. This additive, linear structure means the residual stream acts like a shared communication bus or a “results highway.” Any component can read from the accumulated state of the stream and write its own contribution back. This has profound implications:

  1. Linearity for Analysis: The primary information pathway is linear, allowing for techniques like path expansion and virtual weight computation.
  2. Non-Privileged Basis: The residual stream itself doesn’t inherently have a privileged basis. Any global orthogonal transformation applied consistently to all interacting weight matrices would leave the model functionally unchanged. This reinforces the idea (from Part 2) that features are directions, not necessarily neuron alignments.
  3. Superposition at Scale: With many components writing to and reading from a residual stream of fixed dimensionality (dmodeld_{\text{model}}), it naturally becomes a place where multiple signals (feature activations) are superposed.

Virtual Weights: Unveiling Effective Connectivity

Because of the residual stream’s additive nature, a component in a later layer LJL_J doesn’t just see the direct output of layer LJ1L_J-1; it effectively sees the sum of outputs from all preceding components in layers LI<LJL_I < L_J that wrote to the stream. Virtual weights quantify the effective linear transformation from the input of an earlier component (or its output contribution) to an input processing stage of a later component, considering all intermediate additions and transformations in the residual stream.

Let’s define some terms:

  • Let Mout(CI)\mathbf{M}_{out}^{(C_I)} be the effective output matrix of a component CIC_I (e.g., an attention head HIH_I or an MLP block) in layer LIL_I. If XX is the input to component CIC_I from the residual stream, its output contribution to the stream is OCI=XMout(CI)O_{C_I} = X \mathbf{M}_{out}^{(C_I)}. For an attention head HIH_I, Mout(HI)\mathbf{M}_{out}^{(H_I)} would be its value-output transformation WV(HI)WO(HI)\mathbf{W}_V^{(H_I)}\mathbf{W}_O^{(H_I)} (a dmodel×dmodeld_{\text{model}} \times d_{\text{model}} matrix), assuming the attention pattern itself is fixed or we are analyzing a specific path of information flow through a value vector. For an MLP layer, if it’s a simple linear transformation, Mout(MLP)\mathbf{M}_{out}^{(MLP)} would be its weight matrix. If it’s a non-linear MLP (e.g., with ReLU), Mout(MLP)\mathbf{M}_{out}^{(MLP)} represents an effective linear matrix for a specific input or an average sense, or considers the path through specific active neurons. For the purpose of linear path analysis, we often approximate non-linear components by their local linear behavior or focus on paths where non-linearities are fixed (e.g. a specific ReLU activation pattern). A common formulation for a two-layer MLP is Win(MLP)Wout(MLP)\mathbf{W}_{in}^{(MLP)}\mathbf{W}_{out}^{(MLP)} (again, dmodel×dmodeld_{\text{model}} \times d_{\text{model}}).
  • Let Winproj(CJ)\mathbf{W}_{in-proj}^{(C_J)} be an input projection matrix of a component CJC_J in layer LJL_J. For an attention head HJH_J, this could be its query matrix WQ(HJ)\mathbf{W}_Q^{(H_J)} (dmodel×dheadd_{\text{model}} \times d_{\text{head}}), key matrix WK(HJ)\mathbf{W}_K^{(H_J)} (dmodel×dheadd_{\text{model}} \times d_{\text{head}}), or value matrix WV(HJ)\mathbf{W}_V^{(H_J)} (dmodel×dheadd_{\text{model}} \times d_{\text{head}}).

1. Direct Virtual Weight (No Intermediate Layers, i.e., LJ=LI+1L_J = L_I + 1 or within the same layer if analyzing parallel components):

If component CIC_I outputs OCI=XMout(CI)O_{C_I} = X \mathbf{M}_{out}^{(C_I)} to the stream, and component CJC_J (in the next layer or a later component in the same layer reading from the updated stream) uses an input projection Winproj(CJ)\mathbf{W}_{in-proj}^{(C_J)}, the part of CJC_J’s projected input that comes from XX via CIC_I is (XMout(CI))Winproj(CJ)(X \mathbf{M}_{out}^{(C_I)}) \mathbf{W}_{in-proj}^{(C_J)}. The direct virtual weight matrix Wvirtual(CICJ)\mathbf{W}_{\text{virtual}}^{(C_I \rightarrow C_J)} mapping the input XX (that fed into CIC_I) to this specific contribution at CJC_J’s input projection is:

Wvirtual, direct(CICJ)=Mout(CI)Winproj(CJ)\mathbf{W}_{\text{virtual, direct}}^{(C_I \rightarrow C_J)} = \mathbf{M}_{out}^{(C_I)} \mathbf{W}_{in-proj}^{(C_J)}

For example, the virtual weight from the input of Head HaH_a’s OV circuit (matrix WV(Ha)WO(Ha)\mathbf{W}_V^{(Ha)}\mathbf{W}_O^{(Ha)}) to the Query input projection of Head HbH_b (matrix WQ(Hb)\mathbf{W}_Q^{(Hb)}) in an immediately subsequent processing step is (WV(Ha)WO(Ha))WQ(Hb)(\mathbf{W}_V^{(Ha)}\mathbf{W}_O^{(Ha)}) \mathbf{W}_Q^{(Hb)}. This resulting matrix has dimensions dmodel×dhead(Hb)d_{\text{model}} \times d_{\text{head}}^{(Hb)}.

2. Virtual Weight Across Intermediate Layers:

Now, consider components CIC_I in layer LIL_I and CJC_J in a later layer LJL_J (LJ>LIL_J > L_I). The signal OCIO_{C_I} from CIC_I passes through intermediate layers LkL_k (for LI<Lk<LJL_I < L_k < L_J).

Each intermediate layer LkL_k applies a linear transformation to the signal passing through its residual stream. If layer LkL_k contains components Ck,mC_{k,m} (heads or MLPs) with effective output matrices Mout(k,m)\mathbf{M}_{out}^{(k,m)} (as defined above, noting the linear approximation for MLPs if non-linear), then a signal SS entering layer LkL_k from the previous layer’s residual stream is transformed to S+mSMout(k,m)=S(I+mMout(k,m))S + \sum_m S \mathbf{M}_{out}^{(k,m)} = S (\mathbf{I} + \sum_m \mathbf{M}_{out}^{(k,m)}) upon exiting layer LkL_k. Let Tk=(I+mMout(k,m))T_k = (\mathbf{I} + \sum_m \mathbf{M}_{out}^{(k,m)}) be this full linear transformation for layer LkL_k, representing the cumulative effect of all parallel components in that layer on a signal passing through the residual stream.

The output contribution OCI=XMout(CI)O_{C_I} = X \mathbf{M}_{out}^{(C_I)} from component CIC_I (where XX was its input from the stream) becomes (XMout(CI))TLI+1TLI+2TLJ1(X \mathbf{M}_{out}^{(C_I)}) \cdot T_{L_I+1} \cdot T_{L_I+2} \cdot \ldots \cdot T_{L_J-1} by the time it reaches the input of layer LJL_J. This transformed signal is then processed by CJC_J’s input projection Winproj(CJ)\mathbf{W}_{in-proj}^{(C_J)}. Thus, the full virtual weight matrix from the input XX of component CIC_I to the specific projected input of component CJC_J is:

Wvirtual(CICJ)=Mout(CI)(k=LI+1LJ1Tk)Winproj(CJ)\mathbf{W}_{\text{virtual}}^{(C_I \rightarrow C_J)} = \mathbf{M}_{out}^{(C_I)} \left( \prod_{k=L_I+1}^{L_J-1} T_k \right) \mathbf{W}_{in-proj}^{(C_J)}

If there are no intermediate layers (LJ=LI+1L_J = L_I+1), the product term is empty (or an identity matrix), reducing to the direct case. This concept is crucial for understanding how non-adjacent layers and components influence each other, effectively forming long-range circuits by composing these linear transformations.

Decomposing the Attention Head

The attention mechanism is the heart of the Transformer. It dynamically routes information based on context. An attention head computes its output by attending to various positions in the input sequence and constructing a weighted sum of their value vectors.

Mathematically, for a single attention head, given input token representations x1,,xNRdmodel\mathbf{x}_1, \dots, \mathbf{x}_N \in \mathbb{R}^{d_{\text{model}}}, the head first projects these into Query (qi\mathbf{q}_i), Key (kj\mathbf{k}_j), and Value (vj\mathbf{v}_j) vectors for each token ii (query) and jj (key/value source) using weight matrices WQ,WK,WVRdmodel×dhead\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V \in \mathbb{R}^{d_{\text{model}} \times d_{\text{head}}}:

qi=xiWQ,kj=xjWK,vj=xjWV\mathbf{q}_i = \mathbf{x}_i \mathbf{W}_Q, \quad \mathbf{k}_j = \mathbf{x}_j \mathbf{W}_K, \quad \mathbf{v}_j = \mathbf{x}_j \mathbf{W}_V

(Note: qi,kj,vj\mathbf{q}_i, \mathbf{k}_j, \mathbf{v}_j are row vectors of dimension dheadd_{\text{head}}.)

Attention scores are computed as the dot product of a query vector with a key vector, scaled by dhead\sqrt{d_{\text{head}}}:

eij=qikjTdheade_{ij} = \frac{\mathbf{q}_i \mathbf{k}_j^T}{\sqrt{d_{\text{head}}}}

These scores are then normalized via Softmax across all source positions jj to get attention weights αij=Softmaxj(eij)\alpha_{ij} = \text{Softmax}_j(e_{ij}). The output for query token ii from this head, before the final output projection, is a weighted sum of value vectors: zi=jαijvj\mathbf{z}_i = \sum_j \alpha_{ij} \mathbf{v}_j.

This output zi\mathbf{z}_i (a dheadd_{\text{head}} dimensional row vector) is then projected back to the model dimension using the output weight matrix WORdhead×dmodel\mathbf{W}_O \in \mathbb{R}^{d_{\text{head}} \times d_{\text{model}}}. The head’s final contribution to the residual stream for token ii is

oi=ziWO\mathbf{o}_i = \mathbf{z}_i \mathbf{W}_O

This mechanism can be decomposed into two key conceptual circuits:

  1. Query-Key (QK) Circuit: Determines where to attend. The QK circuit computes the attention scores eije_{ij} (before softmax). The core of this computation is the term qikjT\mathbf{q}_i \mathbf{k}_j^T. Let’s derive its form in terms of the original residual stream vectors xi\mathbf{x}_i and xj\mathbf{x}_j:

    qikjT=(xiWQ)(xjWK)T\mathbf{q}_i \mathbf{k}_j^T = (\mathbf{x}_i \mathbf{W}_Q) (\mathbf{x}_j \mathbf{W}_K)^T

    Using the matrix transpose property (AB)T=BTAT(AB)^T = B^T A^T, we have (xjWK)T=WKTxjT( \mathbf{x}_j \mathbf{W}_K )^T = \mathbf{W}_K^T \mathbf{x}_j^T. Substituting this back, we get:

    qikjT=xiWQWKTxjT\mathbf{q}_i \mathbf{k}_j^T = \mathbf{x}_i \mathbf{W}_Q \mathbf{W}_K^T \mathbf{x}_j^T

    This expression shows that the unnormalized attention score between token ii and token jj is a bilinear form xi(WQWKT)xjT\mathbf{x}_i (\mathbf{W}_Q \mathbf{W}_K^T) \mathbf{x}_j^T.

    The matrix Weff-QK=WQWKT\mathbf{W}_{\text{eff-QK}} = \mathbf{W}_Q \mathbf{W}_K^T is an effective dmodel×dmodeld_{\text{model}} \times d_{\text{model}} matrix that defines how pairs of token representations in the residual stream are compared to produce attention scores. Since WQ\mathbf{W}_Q is dmodel×dheadd_{\text{model}} \times d_{\text{head}} and WKT\mathbf{W}_K^T is dhead×dmodeld_{\text{head}} \times d_{\text{model}}, the rank of Weff-QK\mathbf{W}_{\text{eff-QK}} is at most dheadd_{\text{head}}, which is typically much smaller than dmodeld_{\text{model}}. This low-rank structure implies that the QK circuit is specialized in comparing specific types of information, effectively projecting the dmodeld_{\text{model}}-dimensional token representations into a shared dheadd_{\text{head}}-dimensional space for comparison.

  2. Output-Value (OV) Circuit: Determines what information to move from the attended positions and how it’s transformed. Once attention weights αij\alpha_{ij} are computed, the OV circuit processes the value vectors. The full transformation from an original token representation xj\mathbf{x}_j (at a source position jj) to its potential contribution to the output (if fully attended, i.e., αij=1\alpha_{ij}=1) is xjWVWO\mathbf{x}_j \mathbf{W}_V \mathbf{W}_O.

    The matrix WOV=WVWO\mathbf{W}_{\text{OV}} = \mathbf{W}_V \mathbf{W}_O is an effective dmodel×dmodeld_{\text{model}} \times d_{\text{model}} matrix. (Since WV\mathbf{W}_V is dmodel×dheadd_{\text{model}} \times d_{\text{head}} and WO\mathbf{W}_O is dhead×dmodeld_{\text{head}} \times d_{\text{model}}, their product is dmodel×dmodeld_{\text{model}} \times d_{\text{model}}).

    This matrix describes the transformation applied to a value vector (derived from a token representation xj\mathbf{x}_j in the residual stream) before it’s written back to the residual stream at position ii. Its rank is also at most dheadd_{\text{head}}. For example, if WOVI\mathbf{W}_{\text{OV}} \approx \mathbf{I} (identity matrix), the head primarily copies information from attended positions. If it’s different, it transforms the information.

Analyzing the properties (e.g., SVD, eigenvalues) of WQWKT\mathbf{W}_Q \mathbf{W}_K^T and WVWO\mathbf{W}_V \mathbf{W}_O reveals the specific attention patterns and information processing strategies of individual heads.

Path Expansion and Compositional Power

The overall computation of a Transformer can be viewed as a sum over all possible paths that information can take from the input embedding to the final output logits. Each path involves a sequence of components (attention heads, MLP layers) and their respective weight matrices. While attention introduces non-linearity via softmax, analyzing specific paths (e.g., by fixing attention patterns or looking at linear segments of the computation) is a key strategy.

For an attention-only transformer, the output logit for a token cc given a previous token (position pospos) can be written as:

Logits(cpos)=U[c,:](E[pos,:]+l,hOutputContributionl,h)\text{Logits}(c | pos) = \mathbf{U}[c,:] \left( \mathbf{E}[pos,:] + \sum_{l,h} \text{OutputContribution}_{l,h} \right)

where E\mathbf{E} is the token embedding matrix (row per token, dvocab×dmodeld_{vocab} \times d_{model}), U\mathbf{U} is the unembedding matrix (often ET\mathbf{E}^T, so dmodel×dvocabd_{model} \times d_{vocab}), and OutputContributionl,h\text{OutputContribution}_{l,h} is the output vector added to the residual stream by head hh in layer ll.

This can be expanded. For instance, the output contribution of head (l,h)(l,h) acting on the stream input S(l1)\mathbf{S}^{(l-1)} (output of layer l1l-1) is (jαpos,j(l,h)(Sj(l1)WV(l,h)))WO(l,h)( \sum_j \alpha_{pos,j}^{(l,h)} ( \mathbf{S}_j^{(l-1)}\mathbf{W}_V^{(l,h)} ) ) \mathbf{W}_O^{(l,h)}.

The simplest paths are:

  • Zero-Layer Path: The direct connection from embedding to unembedding. If token at position pospos has embedding vector E[pos,:]\mathbf{E}[pos,:], then the direct contribution to logits is U[c,:]E[pos,:]\mathbf{U}[c,:] \mathbf{E}[pos,:]. This path effectively captures token co-occurrence statistics similar to bigrams if UET\mathbf{U} \approx \mathbf{E}^T.

  • One-Layer Paths: Paths passing through a single attention head. The term U[c,:]Headl,h(E[pos,:])\mathbf{U}[c,:] \text{Head}_{l,h}(\mathbf{E}[pos,:]) describes the influence of head (l,h)(l,h) acting on the initial embedding E[pos,:]\mathbf{E}[pos,:] (if it’s in the first layer) on the logit for token cc. This can implement more complex statistics like skip-trigrams.

Composition Mechanisms: The Source of Complex Algorithms

The true power of multi-layer Transformers comes from composition, where the output of earlier components influences the computation of later components. This is where virtual weights become essential for analysis. For attention heads, this occurs in three main ways:

  1. Q-Composition (Query Composition): The output of head H1H_1 (in layer L1L_1) modifies the residual stream. When head H2H_2 (in layer L2>L1L_2 > L_1) computes its Query vector, it reads from this modified stream. Thus, H1H_1 influences what H2H_2 attends to. Let XX be the input to H1H_1’s OV circuit (matrix Mout(H1)=WV(H1)WO(H1)\mathbf{M}_{out}^{(H_1)} = \mathbf{W}_V^{(H_1)}\mathbf{W}_O^{(H_1)}). Its output contribution is XMout(H1)X \mathbf{M}_{out}^{(H_1)}. If there are intermediate layers transforming this by a product of matrices Tinter=k=L1+1L21(I+mMout(k,m))T_{inter} = \prod_{k=L_1+1}^{L_2-1} ( \mathbf{I} + \sum_m \mathbf{M}_{out}^{(k,m)} ), this signal becomes XMout(H1)TinterX \mathbf{M}_{out}^{(H_1)} T_{inter} as it enters the layer of H2H_2.

    The query vector for H2H_2 is formed from the stream SL21S_{L_2-1} as SL21WQ(H2)S_{L_2-1} \mathbf{W}_Q^{(H_2)}. The part of this query that comes from XX via H1H_1 is (XMout(H1)Tinter)WQ(H2)(X \mathbf{M}_{out}^{(H_1)} T_{inter}) \mathbf{W}_Q^{(H_2)}.

    The virtual weight matrix for this Q-composition path is WQ-comp=Mout(H1)TinterWQ(H2)\mathbf{W}_{\text{Q-comp}} = \mathbf{M}_{out}^{(H_1)} T_{inter} \mathbf{W}_Q^{(H_2)}.

  2. K-Composition (Key Composition): Similarly, H1H_1 can influence the Key vectors that H2H_2 uses for comparison. The output from H1H_1 (XMout(H1)TinterX \mathbf{M}_{out}^{(H_1)} T_{inter}) influences the stream from which H2H_2 forms its key vectors SL21WK(H2)S_{L_2-1} \mathbf{W}_K^{(H_2)}. The virtual weight matrix for this K-composition path (from input of H1H_1’s OV to H2H_2’s K-projection) is WK-comp=Mout(H1)TinterWK(H2)\mathbf{W}_{\text{K-comp}} = \mathbf{M}_{out}^{(H_1)} T_{inter} \mathbf{W}_K^{(H_2)}.

  3. V-Composition (Value Composition): And H1H_1 can influence the Value vectors that H2H_2 aggregates. The output from H1H_1 (XMout(H1)TinterX \mathbf{M}_{out}^{(H_1)} T_{inter}) influences the stream from which H2H_2 forms its value vectors SL21WV(H2)S_{L_2-1} \mathbf{W}_V^{(H_2)}. This then passes through H2H_2’s output projection WO(H2)\mathbf{W}_O^{(H_2)}. The virtual weight matrix for this V-composition path (from input of H1H_1’s OV to H2H_2’s OV output) is WV-comp=Mout(H1)TinterWV(H2)WO(H2)\mathbf{W}_{\text{V-comp}} = \mathbf{M}_{out}^{(H_1)} T_{inter} \mathbf{W}_V^{(H_2)} \mathbf{W}_O^{(H_2)}.

These composition mechanisms, understood via virtual weights, allow for the construction of virtual attention heads, where the combined effect of multiple heads implements a more complex attention pattern or information transformation than any single head could. For instance, K-composition is fundamental to induction heads (explored in Part 9).

Emergent Complexity in Two-Layer Models

While a zero-layer Transformer is limited to bigrams and a one-layer attention-only Transformer to skip-trigrams, a two-layer Transformer can already exhibit qualitatively new capabilities due to composition. For example, an induction head typically requires at least two heads working in sequence:

  • A “previous token” head (Head 1) in an earlier layer L1L_1 copies (parts of) the N1N-1-th token’s representation into the residual stream.
  • An “induction” head (Head 2) in a later layer L2L_2 uses this copied representation. Specifically, via K-composition, the Key vectors generated by H2H_2 for previous tokens in the sequence are modulated by the output of H1H_1. If H2H_2 is looking for the token that followed previous instances of token N1N-1, its Query vector (also potentially influenced by H1H_1’s output via Q-composition) will match strongly with Key vectors of tokens that are N1N-1, and the overall QK circuit of H2H_2 is further specialized to shift attention to the token after these matched N1N-1 tokens. The OV circuit of H2H_2 then copies this successfully identified token. This is a form of in-context learning that is impossible with single heads in isolation.

Let’s derive the explicit mathematical formulations for zero-layer, one-layer, and two-layer transformers to better understand this emergent complexity:

Zero-Layer Transformer: Direct Token Mapping

In a zero-layer transformer, we have direct connections from token embeddings to output logits without any intermediate attention or MLP layers. The mathematical formulation for predicting the next token is simply:

Logits(cpos)=U[c,:]E[pos,:]\text{Logits}(c | \text{pos}) = \mathbf{U}[c,:] \cdot \mathbf{E}[\text{pos},:]

Where:

  • $$\mathbf{E} \in \mathbb{R}^{ V \times d_{\text{model}}}$$ is the embedding matrix
  • $$\mathbf{U} \in \mathbb{R}^{d_{\text{model}} \times V }$$ is the unembedding matrix
  • pos\text{pos} is the position of the input token
  • cc is the candidate output token (in the vocabulary)

When the unembedding matrix is (approximately) the transpose of the embedding matrix (UET\mathbf{U} \approx \mathbf{E}^T), this computation reduces to measuring token similarity:

Logits(cpos)E[c,:]E[pos,:]T=similarity(c,pos)\text{Logits}(c | \text{pos}) \approx \mathbf{E}[c,:] \cdot \mathbf{E}[\text{pos},:]^T = \text{similarity}(c, \text{pos})

This formulation can only capture simple bigram statistics based on embedding similarity. The zero-layer transformer effectively learns which tokens tend to follow other tokens directly, without any contextual understanding.

One-Layer Transformer: Attention-Based Contextual Processing

A one-layer transformer introduces attention mechanisms between the embedding and unembedding steps. For a model with HH attention heads, the logits are computed as:

Logits(cpos)=U[c,:](E[pos,:]+h=1HHeadh(E)[pos,:])\text{Logits}(c | \text{pos}) = \mathbf{U}[c,:] \left( \mathbf{E}[\text{pos},:] + \sum_{h=1}^{H} \text{Head}_h(\mathbf{E})[\text{pos},:] \right)

For each attention head hh processing position pos\text{pos}, the output contribution is:

Headh(E)[pos,:]=j=1posαpos,j(h)(E[j,:]WV(h))WO(h)\text{Head}_h(\mathbf{E})[\text{pos},:] = \sum_{j=1}^{\text{pos}} \alpha_{\text{pos},j}^{(h)} \cdot (\mathbf{E}[j,:] \cdot \mathbf{W}_V^{(h)}) \cdot \mathbf{W}_O^{(h)}

Where the attention weights αpos,j(h)\alpha_{\text{pos},j}^{(h)} are calculated using softmax over attention scores:

αpos,j(h)=exp(epos,j(h))k=1posexp(epos,k(h))\alpha_{\text{pos},j}^{(h)} = \frac{\exp(e_{\text{pos},j}^{(h)})}{\sum_{k=1}^{\text{pos}} \exp(e_{\text{pos},k}^{(h)})}

And the attention scores epos,j(h)e_{\text{pos},j}^{(h)} are:

epos,j(h)=(E[pos,:]WQ(h))(E[j,:]WK(h))Tdheade_{\text{pos},j}^{(h)} = \frac{(\mathbf{E}[\text{pos},:] \cdot \mathbf{W}_Q^{(h)}) \cdot (\mathbf{E}[j,:] \cdot \mathbf{W}_K^{(h)})^T}{\sqrt{d_{\text{head}}}}

This can be rewritten using the effective QK matrix as described earlier:

epos,j(h)=E[pos,:]Weff-QK(h)E[j,:]Tdheade_{\text{pos},j}^{(h)} = \frac{\mathbf{E}[\text{pos},:] \cdot \mathbf{W}_{\text{eff-QK}}^{(h)} \cdot \mathbf{E}[j,:]^T}{\sqrt{d_{\text{head}}}}

Where Weff-QK(h)=WQ(h)WK(h)T\mathbf{W}_{\text{eff-QK}}^{(h)} = \mathbf{W}_Q^{(h)} \cdot \mathbf{W}_K^{(h)T}.

The one-layer transformer can learn to selectively attend to previous tokens based on their relevance to the current position. This allows it to implement skip-trigram patterns by, for example, having position pos\text{pos} attend strongly to positions pos2\text{pos}-2 and pos1\text{pos}-1 to predict the next token.

However, a one-layer transformer cannot implement the induction pattern (copying a token that followed a similar context elsewhere in the sequence) because each head operates independently on the original token embeddings.

Two-Layer Transformer: Composition and Emergent Capabilities

In a two-layer transformer, the output of the first layer’s attention heads becomes the input for the second layer’s heads, enabling composition. For a model with H1H_1 heads in layer 1 and H2H_2 heads in layer 2, the logits are:

Logits(cpos)=U[c,:]S(2)[pos,:]\text{Logits}(c | \text{pos}) = \mathbf{U}[c,:] \cdot \mathbf{S}^{(2)}[\text{pos},:]

Where S(2)\mathbf{S}^{(2)} is the residual stream after layer 2:

S(2)[pos,:]=S(1)[pos,:]+h=1H2Head2,h(S(1))[pos,:]\mathbf{S}^{(2)}[\text{pos},:] = \mathbf{S}^{(1)}[\text{pos},:] + \sum_{h=1}^{H_2} \text{Head}_{2,h}(\mathbf{S}^{(1)})[\text{pos},:]

And S(1)\mathbf{S}^{(1)} is the residual stream after layer 1:

S(1)[pos,:]=E[pos,:]+h=1H1Head1,h(E)[pos,:]\mathbf{S}^{(1)}[\text{pos},:] = \mathbf{E}[\text{pos},:] + \sum_{h=1}^{H_1} \text{Head}_{1,h}(\mathbf{E})[\text{pos},:]

Let’s consider the induction head mechanism in detail. Suppose we’re at position NN in the sequence, and we’ve previously seen the pattern “AA BB” somewhere earlier in the sequence. Now at position N1N-1 we see token “AA” again, and we want to predict “BB” at position NN. This requires:

  1. A “previous token” head (H1H_1) in layer 1 that copies token N1N-1’s representation (the new occurrence of “AA”) to position NN:

    Head1,h1(E)[N,:]E[N1,:]WV(1,h1)WO(1,h1)\text{Head}_{1,h_1}(\mathbf{E})[N,:] \approx \mathbf{E}[N-1,:] \cdot \mathbf{W}_V^{(1,h_1)} \cdot \mathbf{W}_O^{(1,h_1)}

    This is achieved by having the OV circuit of H1H_1 approximate the identity function (WV(1,h1)WO(1,h1)I\mathbf{W}_V^{(1,h_1)} \cdot \mathbf{W}_O^{(1,h_1)} \approx \mathbf{I}) and having the QK circuit attend to the previous token.

  2. An “induction” head (H2H_2) in layer 2 that:

    a. Forms query vectors from the updated stream at position NN, which now contains information about token N1N-1 (i.e., “AA”):

    qN(2,h2)=S(1)[N,:]WQ(2,h2)\mathbf{q}_N^{(2,h_2)} = \mathbf{S}^{(1)}[N,:] \cdot \mathbf{W}_Q^{(2,h_2)}

    This query includes contributions from H1H_1:

    qN(2,h2)=(E[N,:]+Head1,h1(E)[N,:])WQ(2,h2)\mathbf{q}_N^{(2,h_2)} = (\mathbf{E}[N,:] + \text{Head}_{1,h_1}(\mathbf{E})[N,:]) \cdot \mathbf{W}_Q^{(2,h_2)} qN(2,h2)(E[N,:]+E[N1,:]WV(1,h1)WO(1,h1))WQ(2,h2)\mathbf{q}_N^{(2,h_2)} \approx (\mathbf{E}[N,:] + \mathbf{E}[N-1,:] \cdot \mathbf{W}_V^{(1,h_1)} \cdot \mathbf{W}_O^{(1,h_1)}) \cdot \mathbf{W}_Q^{(2,h_2)}

    b. Forms key vectors for previous positions:

    kj(2,h2)=S(1)[j,:]WK(2,h2)\mathbf{k}_j^{(2,h_2)} = \mathbf{S}^{(1)}[j,:] \cdot \mathbf{W}_K^{(2,h_2)} kj(2,h2)=(E[j,:]+h=1H1Head1,h(E)[j,:])WK(2,h2)\mathbf{k}_j^{(2,h_2)} = (\mathbf{E}[j,:] + \sum_{h=1}^{H_1} \text{Head}_{1,h}(\mathbf{E})[j,:]) \cdot \mathbf{W}_K^{(2,h_2)}

    c. Computes attention scores:

    eN,j(2,h2)=qN(2,h2)kj(2,h2)Tdheade_{N,j}^{(2,h_2)} = \frac{\mathbf{q}_N^{(2,h_2)} \cdot \mathbf{k}_j^{(2,h_2)T}}{\sqrt{d_{\text{head}}}}

    The virtual weight matrix for this K-composition path is:

    WK-comp=WV(1,h1)WO(1,h1)WK(2,h2)\mathbf{W}_{\text{K-comp}} = \mathbf{W}_V^{(1,h_1)} \cdot \mathbf{W}_O^{(1,h_1)} \cdot \mathbf{W}_K^{(2,h_2)}

    If WK-comp\mathbf{W}_{\text{K-comp}} is structured appropriately, H2H_2 will attend strongly to positions where the token is the same as token N1N-1 (i.e., other occurrences of “AA”).

    d. Once H2H_2 attends to previous occurrences of “AA”, it then needs to shift attention to the tokens that followed them (i.e., “BB”). This can be implemented through appropriate training of the QK circuit to focus on tokens at position j+1j+1 when matching with token at position jj.

    e. Finally, the OV circuit of H2H_2 copies the attended token (“BB”) to position NN:

    Head2,h2(S(1))[N,:]S(1)[j+1,:]WV(2,h2)WO(2,h2)\text{Head}_{2,h_2}(\mathbf{S}^{(1)})[N,:] \approx \mathbf{S}^{(1)}[j+1,:] \cdot \mathbf{W}_V^{(2,h_2)} \cdot \mathbf{W}_O^{(2,h_2)}

    Where jj is the position of a previous occurrence of “AA”.

This complex interaction between heads across layers enables the two-layer transformer to implement in-context learning - predicting “BB” after seeing “AA” based on previously observed “AA BB” patterns. This capability emerges from the composition of simpler operations and cannot be achieved in models with fewer layers.

The key insight is that the output of the first layer’s heads modifies the residual stream in a way that influences the attention patterns of the second layer’s heads. This composition enables the emergence of algorithmic capabilities that transcend what each individual head can do in isolation.

Conclusion

This mathematical framework provides the tools to dissect Transformers into their constituent computational parts: the residual stream as a communication bus, attention heads decomposed into QK and OV circuits, and the powerful concept of composition that allows simple components to build complex algorithms. By analyzing virtual weights, path expansions, and composition strengths, we can start to reverse-engineer the specific computations learned by these models.

This foundation is crucial for understanding phenomena like superposition within these architectures and for developing techniques to extract and validate the features and circuits that implement their remarkable capabilities. In Part 5, we will explore the validation of learned features and circuits, building on the mathematical framework established here.


References and Further Reading

This framework is primarily based on the work by Elhage et al. and Olsson et al. in the Transformer Circuits Thread: