Having explored how features might be represented (Part 2, Part 4), and how to analyze information flow (Part 3), we now turn to how these features and information pathways combine to form neural network circuits. A circuit is a subgraph of the neural network—comprising features (neurons or dictionary elements) as nodes and weights (or effective connections) as edges—that implements a specific, identifiable algorithm or computation. This part introduces the concept of circuits, their hierarchical nature, a taxonomy of common circuit motifs, and then explores how these concepts apply specifically to attention head circuits in Transformers.

Defining Neural Network Circuits: The Building Blocks of Learned Algorithms

In the pursuit of mechanistic interpretability, a “circuit” transcends an arbitrary grouping of network components. It refers to a specific, delineated computational mechanism within the broader neural network, hypothesized to execute a particular function or algorithm. The identification and rigorous understanding of these circuits are fundamental to reverse-engineering the processes by which models perform complex tasks.

A circuit is conceptually defined by three primary elements:

  • Nodes: These are the fundamental processing units of the circuit. Nodes can be individual neurons within the network’s architecture. More abstractly, and often more fruitfully for interpretability, nodes can represent monosemantic features—directions in activation space that correspond to single, coherent concepts, ideally derived via techniques like dictionary learning as discussed in Part 2. The aspiration is for these nodes to be interpretable units of meaning or computation.

  • Edges: These represent the pathways of influence and information flow between nodes, characterized by weighted connections. In intricate, multi-layer architectures such as Transformers, these edges often correspond to effective or virtual weights. As detailed in Part 3, such virtual weights summarize the net linear influence of one node (e.g., the output of a feature or component) on another, even if that influence traverses multiple layers and incorporates additive contributions from residual stream skip connections.

  • Algorithm: Each circuit is associated with a hypothesized algorithm—a description of the computation it is believed to perform. This could range from simple operations like “copy the representation of the previous token” or “detect occurrences of negated sentiment,” to more complex tasks such as “identify and complete a repeated sequence based on in-context examples.” The validity of this hypothesized algorithm is subject to rigorous empirical validation as outlined in Part 5.

The Hierarchical Nature of Neural Circuits

Neural circuits rarely exist in isolation; rather, they often exhibit a hierarchical organization, where simpler circuits combine to form more complex computational structures. This hierarchy allows for the construction of sophisticated algorithms from elementary operations, mirroring the compositional nature of traditional software or even biological neural systems. We can typically distinguish circuits at multiple scales of abstraction:

  1. Micro-circuits: These represent the most elementary computational units, the smallest identifiable building blocks performing fundamental operations. Their scope is typically highly localized, involving a few features or neurons and their direct connections.
    • Conceptual Examples: Consider a feature finf_{in} represented by a normalized direction vector din\mathbf{d}_{in} in an activation space. A micro-circuit that “copies” this feature would involve a transformation (e.g., by a weight matrix W\mathbf{W} from an OV circuit, or a segment of an MLP) such that an output feature direction dout\mathbf{d}_{out} preserves information about finf_{in}. If foutf_{out} is the activation projected onto dout\mathbf{d}_{out} and finf_{in} is the input activation, we might find foutcfinf_{out} \approx c \cdot f_{in}. Another example is a single MLP neuron, whose activation m=ReLU(wTx+b)m = \text{ReLU}(\mathbf{w}^T \mathbf{x} + b) can act as a micro-circuit detecting a specific pattern if its weight vector w\mathbf{w} aligns with a particular combination of input features in x\mathbf{x}.
  2. Meso-circuits (or Component Circuits): These are formed by the composition of several micro-circuits to execute a more substantial, yet still relatively localized, sub-task. Frequently, an entire architectural component, such as a single attention head in a Transformer, or a small, functionally cohesive group of MLP neurons, might constitute a meso-circuit.
    • Conceptual Examples: A full attention head, with its QK and OV sub-components, implementing a specific, interpretable attention pattern (e.g., a “previous token head” that consistently attends to and copies the representation of the token at position t1t-1 to position tt) is a canonical meso-circuit. Similarly, imagine an ensemble of dictionary features faf_a and fbf_b. A meso-circuit for a logical AND operation might involve these features feeding into a downstream feature fcf_c (perhaps via an intermediate MLP neuron) such that fcf_c activates only when both faf_a and fbf_b are simultaneously active at sufficient strength.
  3. Macro-circuits: These represent larger-scale computational structures, often spanning multiple layers and composing multiple meso-circuits. Macro-circuits are responsible for implementing significant, human-identifiable capabilities or high-level algorithms that contribute directly to the model’s overall task performance.
    • Conceptual Examples: The induction circuit in Transformers, which will be detailed in Part 7, is a prime example of a macro-circuit. It enables a basic form of in-context learning by composing the actions of multiple attention heads across different layers. Other macro-circuits might be responsible for complex reasoning steps or for specific stylistic controls in generative models. Research continues to identify and categorize new circuit motifs and specialized head functions, such as investigations into “successor heads” (Anthropic Circuits Update, September 2024) which may represent further specialized computational patterns.

The process of understanding often proceeds by identifying macro-level behaviors and then recursively decomposing them into their constituent meso- and, ultimately, micro-circuits, tracing the flow of information and computation down to its elementary steps.

A Taxonomy of Common Circuit Motifs: Recurring Computational Patterns

Through the analysis of various neural network architectures across different tasks, certain computational motifs—recurring patterns of feature interaction and information processing—have been observed. Developing a taxonomy of these canonical circuit types provides a valuable vocabulary and conceptual toolkit for dissecting and understanding network behavior. These motifs are not always perfectly clean or isolated, but they represent common strategies learned by networks.

1. Copying/Memory Circuits: Transmitting Information Faithfully

  • Function: These circuits are specialized in preserving and transmitting information largely unchanged from one part of the network to another, or across different token positions in a sequence. They act as conduits or short-term memory buffers.
  • Mathematical Signature & Elaboration: A key example is found in Transformer attention heads. The Output-Value (OV) circuit, characterized by the effective matrix WOV=WVWO\mathbf{W}_{OV} = \mathbf{W}_V \mathbf{W}_O, transforms the attended value vector vj=xjWV\mathbf{v}_j = \mathbf{x}_j \mathbf{W}_V. If, for a specific input feature direction d\mathbf{d} (representing a concept to be copied), the transformation (dWV)WOcd(\mathbf{d}\mathbf{W}_V)\mathbf{W}_O \approx c \mathbf{d} for some scalar c0c \neq 0, then the feature direction d\mathbf{d} is effectively copied (potentially scaled) from the attended source token jj to the current token ii’s residual stream. The Query-Key (QK) circuit determines which source token jj (and thus which specific instantiation of the feature d\mathbf{d}) is selected for this copying operation.
  • Example: “Previous token heads” in language models frequently employ such OV circuits to copy the complete representation (or salient features) of the token at position t1t-1 to the residual stream at position tt, providing immediate local context for the next prediction.

2. Inhibition Circuits: Implementing Conditional Suppression

  • Function: These circuits serve to suppress or negate the influence of certain features or the output of other circuits, often implementing a form of conditional logic or control.
  • Mathematical Signature & Elaboration: The most direct signature is the presence of negative weights or negative effective (virtual) weights in the connections between features. If a feature gg computes its activation as (for simplicity, before non-linearity) actg=wAfA+wBfB++bgact_g = w_A f_A + w_B f_B + \dots + b_g, and the weight wBw_B is significantly negative, then a high activation of feature fBf_B will reduce actgact_g, thereby inhibiting gg. In a multi-layer circuit, a virtual weight between an upstream feature fUf_U and a downstream feature fDf_D might be negative, indicating that high fUf_U tends to suppress fDf_D, even if the direct path involves multiple steps.
  • Example: In a model designed to assess safety, a feature detecting toxic language might be inhibited by another feature that identifies the context as fictional (e.g., quoting a villain in a novel), preventing a false positive safety flag.

3. Equivariant Circuits: Responding Consistently to Transformations

  • Function: Equivariant circuits are characterized by their consistent response patterns when their inputs undergo specific transformations. For instance, translation equivariance means that if the input is shifted, the output is correspondingly shifted. This is a powerful inductive bias for tasks where patterns can occur at different positions or orientations.
  • Mathematical Signature & Elaboration: The structure of the circuit’s weight matrices must exhibit particular symmetries. For a 1D sequence (like text), if TkT_k is an operator that shifts an input sequence x\mathbf{x} by kk positions, a circuit CC is translation equivariant if C(Tk(x))Tk(C(x))C(T_k(\mathbf{x})) \approx T_k(C(\mathbf{x})) (or is identical up to boundary effects). In Transformers, this can be approximated if the QK and OV matrices interact with positional encodings in a way that prioritizes relative positions, or if attention scores are primarily content-based and relative positions are handled by other mechanisms. Convolutional Neural Networks (CNNs) are a classical example where weight sharing explicitly builds in translation equivariance.
  • Example: An attention head in a vision transformer might learn to detect a specific local texture (e.g., fur) regardless of where that texture appears in the image (approximate translation equivariance).

4. Union Circuits (OR-like Behavior): Detecting Presence of Any

  • Function: These circuits activate if any one or more of a predefined set of input features are present or sufficiently active. They implement a logical OR-like operation.
  • Mathematical Signature & Elaboration: Consider a downstream feature (or neuron) gg whose activation before non-linearity is actg=iwifibact_g = \sum_i w_i f_i - b, where fif_i are upstream features and wi>0w_i > 0. If the features fif_i are approximately binary (e.g., 0 for absent, 1 for present after some normalization or if they are dictionary features from a ReLU autoencoder), an OR-like behavior is achieved if the bias bb is set to be less than the smallest effective contribution from any single active feature (i.e., b<mini(wi)b < \min_i(w_i) if fi=1f_i=1 is the activation level). In this case, if any fif_i becomes active, wifib>0w_i f_i - b > 0, causing gg to activate (assuming a ReLU or similar threshold non-linearity).
  • Example: The induction circuit in Transformers, which will be detailed in Part 7, is a prime example of a macro-circuit. It enables a basic form of in-context learning by composing the actions of multiple attention heads across different layers. Other macro-circuits might be responsible for complex reasoning steps or for specific stylistic controls in generative models. Research continues to identify and categorize new circuit motifs and specialized head functions, such as investigations into “successor heads” (Anthropic Circuits Update, September 2024) which may represent further specialized computational patterns.

5. Intersection Circuits (AND-like Behavior): Detecting Joint Presence

  • Function: These circuits activate only if all features within a specific set (or a critical combination of them) are simultaneously present or active. They implement a logical AND-like operation.
  • Mathematical Signature & Elaboration: This is often a role for MLP neurons, or can be constructed with dictionary features. Consider an MLP neuron with activation m=ReLU(w1f1+w2f2b)m = \text{ReLU}(w_1 f_1 + w_2 f_2 - b), where f1,f2f_1, f_2 are binary-like input features (active=1, inactive=0) and w1,w2>0w_1, w_2 > 0. For AND behavior, the bias bb must be calibrated such that the neuron only activates if both f1f_1 and f2f_2 are active. Specifically, we need the sum of weighted inputs to be greater than the bias only when both are active. This means: w1b0w_1 - b \le 0 (or <0<0 for strict AND), w2b0w_2 - b \le 0 (or <0<0), but w1+w2b>0w_1 + w_2 - b > 0. This implies that the bias bb must be in the range [max(w1,w2),w1+w2)[\max(w_1, w_2), w_1+w_2). Such a configuration ensures that the individual presence of either f1f_1 or f2f_2 is insufficient to overcome the bias, but their combined weighted sum is.
  • Example: A feature intended to identify “a capital city located in the European continent” would require the co-activation of an upstream feature for “is a capital city” AND another feature for “is located in Europe.”

6. Compositional Circuits (Sequential Algorithmic Processing)

  • Function: These circuits are fundamental to building multi-step algorithms. They combine the outputs of earlier-stage circuits or features to compute more complex functions, where the operation of a later stage is contingent upon the results from an earlier stage.
  • Mathematical Signature & Elaboration: The concept of composition was extensively detailed in Part 3 with respect to Transformer attention heads. For instance, Q-composition occurs when the output of a head H1H_1 (denoted O1O_1) is added to the residual stream r\mathbf{r}, resulting in a modified stream r=r+O1\mathbf{r}' = \mathbf{r} + O_1. A subsequent head H2H_2 then forms its query vectors using this modified stream: q(H2)=rWQ(H2)=(r+O1)WQ(H2)\mathbf{q}^{(H2)} = \mathbf{r}' \mathbf{W}_Q^{(H2)} = (\mathbf{r} + O_1) \mathbf{W}_Q^{(H2)}. This equation explicitly demonstrates how the output of H1H_1 influences the attention pattern of H2H_2 by altering its query vectors. Similar logic applies to K-composition (influencing key vectors) and V-composition (influencing value vectors, and thus the information retrieved and processed).
  • Example: Induction heads, which enable basic in-context learning in Transformers (discussed in Part 7), are a powerful and well-studied instance of compositional circuits, typically involving the sequential operation of at least two attention heads.

Attention Head Circuits: Specialized Meso-Circuits in Transformers

Attention heads in Transformer models serve as particularly well-studied examples of meso-circuits. Each head can be decomposed into Query-Key (QK) and Output-Value (OV) circuits that work together to implement specific attention patterns and information processing functions. Understanding these patterns is key to deciphering how Transformers process sequential information. The study of these patterns also extends to understanding their consistency and potential universality across different layers, models, or training runs. Techniques such as “Crosscoder Model Diffing” and the use of “Sparse Crosscoders” (Anthropic Circuits Updates, Feb 2025, Oct 2024) are being explored to identify features and attention mechanisms that are conserved, shedding light on common learned strategies.

Recall that for an input token representation xRdmodel\mathbf{x} \in \mathbb{R}^{d_{\text{model}}}, an attention head, with weight matrices WQ,WKRdmodel×dhead\mathbf{W}_Q, \mathbf{W}_K \in \mathbb{R}^{d_{\text{model}} \times d_{\text{head}}} and WVRdmodel×dhead\mathbf{W}_V \in \mathbb{R}^{d_{\text{model}} \times d_{\text{head}}}, WORdhead×dmodel\mathbf{W}_O \in \mathbb{R}^{d_{\text{head}} \times d_{\text{model}}}, computes:

  • Queries: q=xWQRdhead\mathbf{q} = \mathbf{x} \mathbf{W}_Q \in \mathbb{R}^{d_{\text{head}}}
  • Keys: k=xWKRdhead\mathbf{k} = \mathbf{x} \mathbf{W}_K \in \mathbb{R}^{d_{\text{head}}}
  • Values: v=xWVRdhead\mathbf{v} = \mathbf{x} \mathbf{W}_V \in \mathbb{R}^{d_{\text{head}}}

Attention scores αij=Softmaxj(qikjTdhead)\alpha_{ij} = \text{Softmax}_j(\frac{\mathbf{q}_i \mathbf{k}_j^T}{\sqrt{d_{\text{head}}}}) determine how much the query token ii attends to the key token jj. The head’s output contribution to the residual stream for token ii is oi=(jαijvj)WORdmodel\mathbf{o}_i = (\sum_j \alpha_{ij} \mathbf{v}_j) \mathbf{W}_O \in \mathbb{R}^{d_{\text{model}}}.

The QK Circuit: Generating Diverse Attention Patterns

The QK circuit, primarily characterized by the effective weight matrix WQK=WQWKTRdmodel×dmodel\mathbf{W}_{QK} = \mathbf{W}_Q \mathbf{W}_K^T \in \mathbb{R}^{d_{\text{model}} \times d_{\text{model}}}, is responsible for determining where a head directs its attention. The unnormalized attention score eij=xiWQKxjTe_{ij} = \mathbf{x}_i \mathbf{W}_{QK} \mathbf{x}_j^T between a query token ii (with representation xi\mathbf{x}_i) and a key token jj (with representation xj\mathbf{x}_j) dictates the strength of this connection. The diverse structures learned within different WQK\mathbf{W}_{QK} matrices give rise to a taxonomy of common attention patterns, each serving distinct computational roles:

Previous Token Heads are fundamental for capturing immediate local context. They specialize in attending strongly to the token immediately preceding the current token (e.g., token at position tt queries information from token at position t1t-1). Their mathematical signature often involves WQK\mathbf{W}_{QK} being structured to leverage differences in positional encodings. For instance, the query projection xtWQ\mathbf{x}_t \mathbf{W}_Q might become sensitive to aspects of the positional encoding unique to position tt, while the key projection xt1WK\mathbf{x}_{t-1} \mathbf{W}_K similarly captures aspects of position t1t-1. The matrix WQWKT\mathbf{W}_Q \mathbf{W}_K^T would then be such that the dot product (xtWQ)(xt1WK)T(\mathbf{x}_t \mathbf{W}_Q) (\mathbf{x}_{t-1} \mathbf{W}_K)^T is maximized. These heads are crucial for tasks like autoregressive language modeling, enabling predictions based on the most recent token and forming rudimentary n-gram-like statistics (e.g., bigrams).

Positional or Fixed Offset Heads generalize this concept by attending to tokens at various fixed relative offsets from the current token (e.g., tkt-k for some small integer kk), or to specific absolute positions within the sequence, most notably the beginning-of-sequence (BOS/CLS) token. The WQK\mathbf{W}_{QK} matrix in such heads interacts strongly with specific components of the positional encodings. For example, a BOS head might have its WQ\mathbf{W}_Q matrix tuned to respond to the unique positional encoding of the BOS token, or its WK\mathbf{W}_K matrix might be structured to make the BOS token’s key vector highly attractive to many query positions. BOS/CLS heads often serve to aggregate global information from across the sequence, while other fixed offset heads can capture skip-gram-like information or detect structural properties related to fixed distances in text (e.g., attending two positions back to check for a specific grammatical pattern).

Same Token Heads, also known as diagonal attention heads, primarily focus attention on the current token itself (i.e., position tt attends to position tt). Mathematically, their WQK\mathbf{W}_{QK} matrix might be structured to produce high scores when the query and key positions are identical. This could occur if, for certain feature subspaces represented by a direction d\mathbf{d}, dTWQKd\mathbf{d}^T \mathbf{W}_{QK} \mathbf{d} is significantly larger than dTWQKd\mathbf{d}^T \mathbf{W}_{QK} \mathbf{d'} where d\mathbf{d'} is orthogonal to d\mathbf{d} (and from a different position). Such heads can be instrumental in gathering and potentially amplifying or transforming features from the current token’s own representation before further processing by the OV circuit or subsequent layers.

Content-Based or Pattern-Matching Heads offer more dynamic and flexible attention mechanisms. Instead of relying on fixed positions, they attend to tokens whose content (as encoded in their vector representations xj\mathbf{x}_j) matches a pattern or concept being sought by the query vector xiWQ\mathbf{x}_i \mathbf{W}_Q. The matrix WQK\mathbf{W}_{QK} in these heads is structured to compute high similarity scores (e.g., via dot products) between query vectors that represent a certain type of semantic or syntactic information and key vectors from tokens that embody that information. Often, such WQK\mathbf{W}_{QK} matrices are found to be low-rank. If WQK=kσkukvkT\mathbf{W}_{QK} = \sum_k \sigma_k \mathbf{u}_k \mathbf{v}_k^T is its singular value decomposition (where uk,vk\mathbf{u}_k, \mathbf{v}_k are left and right singular vectors, respectively, and σk\sigma_k are singular values), then the unnormalized attention score contribution from the kk-th component is σk(xiuk)(xjvk)T\sigma_k (\mathbf{x}_i \mathbf{u}_k) (\mathbf{x}_j \mathbf{v}_k)^T. This formulation explicitly shows how the alignment of the query token’s representation xi\mathbf{x}_i with a “query pattern” direction uk\mathbf{u}_k and the key token’s representation xj\mathbf{x}_j with a corresponding “key pattern” direction vk\mathbf{v}_k determines the attention strength for that component. These heads are functionally diverse, enabling the model to, for instance, locate all mentions of a previously introduced entity, find words semantically related to a query concept, or identify tokens matching a learned syntactic role, irrespective of their absolute or relative positions.

Finally, Broadcast or Diffuse Heads exhibit attention patterns that are spread very broadly, often with near-uniform attention weights distributed across many tokens in the context, or they may attend to no specific tokens (e.g., focusing only on padding tokens if present). Their mathematical signature might be a WQK\mathbf{W}_{QK} matrix that is close to a zero matrix, or one structured such that the pre-softmax scores eij/dheade_{ij} / \sqrt{d_{\text{head}}} are nearly constant for many pairs (i,j)(i,j), leading to αij1/Ncontext\alpha_{ij} \approx 1/N_{\text{context}} where NcontextN_{\text{context}} is the number of attended tokens. Alternatively, the query vectors xiWQ\mathbf{x}_i \mathbf{W}_Q produced by such heads might simply have very small magnitudes. These heads can effectively act as a passthrough for the residual stream if their OV circuit also performs minimal processing (i.e., is close to an identity transformation). They are sometimes interpreted as being effectively “turned off” or perhaps waiting for a very specific, rare trigger pattern. In other cases, they might contribute to a default aggregation or smoothing of information from the context.

The OV Circuit: Processing and Routing Attended Information

Once the QK circuit has determined where the head should attend by producing attention weights αij\alpha_{ij}, the Output-Value (OV) circuit takes over. Defined by the effective matrix WOV=WVWORdmodel×dmodel\mathbf{W}_{OV} = \mathbf{W}_V \mathbf{W}_O \in \mathbb{R}^{d_{\text{model}} \times d_{\text{model}}}, this circuit dictates what information is moved from the attended value vectors (vj=xjWV\mathbf{v}_j = \mathbf{x}_j \mathbf{W}_V) and, crucially, how this information is transformed before being written back to the residual stream. The final output contribution is

oi=(jαijxjWV)WO\mathbf{o}_i = (\sum_j \alpha_{ij} \mathbf{x}_j \mathbf{W}_V) \mathbf{W}_O

Common functional roles implemented by the OV circuit include:

Copying Information: This is perhaps the most straightforward function. The OV circuit aims to transmit the attended information largely unchanged. Mathematically, this implies that for the subspace of features being attended to, the matrix WOV\mathbf{W}_{OV} acts approximately as a scaled identity matrix, i.e., WOVcI\mathbf{W}_{OV} \approx c\mathbf{I} for some scalar cc. If a vector z\mathbf{z} lies in this feature subspace, then zWOVcz\mathbf{z} \mathbf{W}_{OV} \approx c \mathbf{z}. When combined with a specific attention pattern, this allows for targeted information routing. For instance, a previous token head (determined by the QK circuit) paired with a copying OV circuit will effectively copy the representation of token t1t-1 to the residual stream at position tt.

Feature Transformation or Extraction: More sophisticated OV circuits perform specific linear transformations on the attended value vectors, rather than just copying them. In such cases, WOV\mathbf{W}_{OV} is not an identity matrix. It might project the aggregated, attended information jαijvj\sum_j \alpha_{ij} \mathbf{v}_j onto a specific subspace, effectively extracting or emphasizing certain features while diminishing others. If WOV\mathbf{W}_{OV} has a low effective rank, its operation can be seen as projecting the attended information onto a lower-dimensional subspace defined by its principal left singular vectors (if thinking of WOV\mathbf{W}_{OV} transforming column vectors) or right singular vectors (for row vectors as used here). For example, a content-based head might attend to all textual mentions of “dates,” and its OV circuit could then transform these varied date representations into a standardized canonical format or extract a specific abstract feature like “is this date in the past?”

Information Suppression or No-Op Output: Some OV circuits may learn to effectively suppress the flow of information. This can occur if WOV0\mathbf{W}_{OV} \approx \mathbf{0} (the zero matrix), or if the value vectors vj=xjWV\mathbf{v}_j = \mathbf{x}_j \mathbf{W}_V produced by the value projection matrix WV\mathbf{W}_V are themselves consistently near zero for the types of inputs the head attends to. In such cases, even if the QK circuit generates a clear and specific attention pattern, the head writes little or nothing back to the residual stream, rendering it a near no-op in terms of its output contribution. However, the attention scores themselves could, in principle, still be read or utilized by other hypothetical components in more complex interpretability scenarios, though this is less common for standard model operations.

Analyzing Head Behavior: Connecting Weights to Algorithmic Function

To gain a comprehensive understanding of an individual attention head’s algorithmic role, it is essential to analyze both its QK and OV circuits in conjunction. Several analytical techniques aid this process:

  • Singular Value Decomposition (SVD) of WQK\mathbf{W}_{QK} and WOV\mathbf{W}_{OV}: Performing SVD on these effective matrices can reveal their principal directions of transformation and their effective rank. The magnitudes of the singular values indicate the importance or strength of transformation along each corresponding singular vector direction. A rapid decay in singular values suggests that the matrix is low-rank and thus specializes in processing information within a smaller subspace. The singular vectors themselves identify these specific input and output feature directions that the matrix primarily operates on. For WQK\mathbf{W}_{QK}, these directions in Rdmodel\mathbb{R}^{d_{\text{model}}} show what types of query-information and key-information are being compared. For WOV\mathbf{W}_{OV}, they show what types of information from attended tokens are read into the head (via WV\mathbf{W}_V) and what types of information are written out (via WO\mathbf{W}_O).

  • Maximal Activating Examples for Query and Key Directions: Once SVD (or other feature discovery methods) identifies important operational directions (e.g., singular vectors of WQ\mathbf{W}_Q or WK\mathbf{W}_K), examining dataset examples that maximally activate these directions helps to assign semantic labels to them. This provides insight into what kinds of input features or concepts cause high activations in the query and key vectors along these critical processing pathways.

  • Probing the WOV\mathbf{W}_{OV} Transformation: To understand how WOV\mathbf{W}_{OV} transforms information, one can test its effect on known or hypothesized input feature directions. If din\mathbf{d}_{\text{in}} represents an interpretable input feature (e.g., a dictionary feature from an autoencoder), one can compute dout=(dinWV)WO\mathbf{d}_{\text{out}} = (\mathbf{d}_{\text{in}}\mathbf{W}_V)\mathbf{W}_O and then analyze dout\mathbf{d}_{\text{out}} to see if it corresponds to a meaningful or predictable transformation of din\mathbf{d}_{\text{in}}. This helps to characterize the specific computation performed by the OV circuit.

Mathematical Formalisms for Describing Circuits: A Language for Analysis

To analyze and communicate the structure and function of neural circuits, several mathematical formalisms are employed, providing a precise language for their description.

  • Graph Theory: Circuits are inherently graph-like structures. Features (or neurons) form the nodes of the graph, and the (effective) weights or connections between them constitute the directed, weighted edges. Graph-theoretic concepts such as paths, path lengths, connectivity, and motifs can then be applied to analyze circuit properties. For example, identifying strongly connected components or critical paths can offer insights into information flow and processing bottlenecks.

  • Path Expansion: The overall computational effect of a circuit, particularly for linear or piece-wise linear segments, can often be understood by summing or composing the contributions of all information-carrying paths through it. For a purely linear chain of transformations where an input vector x\mathbf{x} is sequentially multiplied by weight matrices W1,W2,,Wk\mathbf{W}_1, \mathbf{W}_2, \dots, \mathbf{W}_k, the final output vector y\mathbf{y} is given by y=xW1W2Wk=x(i=1kWi)\mathbf{y} = \mathbf{x} \mathbf{W}_1 \mathbf{W}_2 \dots \mathbf{W}_k = \mathbf{x} (\prod_{i=1}^k \mathbf{W}_i). When non-linearities like ReLU or Softmax are present, a full path expansion becomes more complex, often involving a sum over many linearized segments corresponding to specific activation patterns (e.g., which ReLUs are active or inactive). This allows for attributing the final output back to initial inputs through specific pathways.

  • Virtual Weights: As elaborated in Part 3, virtual weights are an indispensable tool for analyzing circuits that span multiple layers within residual architectures like Transformers. The virtual weight matrix WIJ\mathbf{W}_{I \rightarrow J} from an upstream component II to a downstream component JJ encapsulates the total effective linear transformation that a signal from II undergoes before influencing JJ, accounting for all intermediate additive contributions from the residual stream. This abstraction simplifies the analysis of long-range interactions by collapsing complex multi-step pathways into single effective linear maps.

Attention Heads as Composable Meso-Circuits

Each attention head, with its distinct QK-defined attention pattern and its OV-defined information processing function, acts as a self-contained meso-circuit. These meso-circuits are the fundamental building blocks that, through mechanisms of composition (such as Q-composition, K-composition, and V-composition, detailed in Part 3), combine to form more complex, multi-layered macro-circuits. A prime example of such emergent complexity is the induction head circuit (which we will explore in detail in Part 7), typically formed by the synergistic composition of at least two simpler attention heads, such as a previous token head and a pattern-matching/copying head, to achieve basic in-context learning.

The Scientific Process of Discovering and Validating Circuits

The identification and understanding of neural circuits is rarely a one-shot discovery but rather an iterative scientific endeavor, mirroring the classical cycle of hypothesis, experimentation, and refinement. This process typically involves:

  1. Observation and Hypothesis Generation: The process often begins with observing a specific behavior, capability, or failure mode of the model. Based on this observation and an understanding of the model architecture, researchers formulate hypotheses about which components (neurons, features, attention heads) and pathways constitute a circuit responsible for the observed phenomenon.

  2. Experimental Validation: The hypothesized circuit and its proposed function are then subjected to rigorous empirical testing. This involves employing a suite of validation techniques as detailed in Part 5, such as causal interventions (e.g., activation patching, ablation studies), examining maximal activating examples for key nodes, analyzing behavior on synthetic stimuli, and tracing information flow.

  3. Refinement and Iteration: The results of these experiments provide evidence that may support, contradict, or necessitate refinement of the initial hypothesis. If predictions based on the hypothesized circuit are not borne out, the circuit model must be revised or rejected. This iterative loop of proposing a circuit, predicting its behavior under various conditions, experimentally testing these predictions, and updating the circuit model based on the outcomes is central to advancing mechanistic understanding.

Conclusion: Circuits as the Key to Algorithmic Understanding

Conceptualizing neural networks as collections of interacting computational circuits offers a powerful and structured paradigm for demystifying their internal mechanisms. By identifying these circuits, categorizing them based on recurring computational motifs, and understanding their hierarchical organization from micro- to macro-scales, we can begin to unravel the complex, learned algorithms that underpin their capabilities.

Attention heads in Transformers serve as particularly well-studied examples of meso-circuits, demonstrating how the general circuit taxonomy applies to specific architectural components. Their decomposition into QK and OV circuits reveals diverse attention patterns—from fixed positional biases to dynamic content-based matching—and various information processing functions, from copying to feature transformation.

The mathematical formalisms of graph theory, path expansion, and virtual weights provide the analytical language, while a rigorous, iterative process of hypothesis and validation provides the empirical grounding. This approach allows the field to move beyond treating models as opaque black boxes towards achieving detailed, mechanistic explanations of their behavior.

Subsequent parts of this series will delve into the analysis of specific, well-characterized circuits found within Transformer models, focusing particularly on the remarkable phenomenon of in-context learning through induction heads.


References and Further Reading