Mechanistic Interpretability: Part 8 - Circuit Tracing and Insights from SOTA Models

I. Introduction

This series has charted a course through the burgeoning field of mechanistic interpretability, starting from the foundational concepts of meaningful features and the circuits paradigm (Part 1). We explored the challenge of polysemanticity and how techniques like dictionary learning aim to recover monosemantic features from superposed representations (Part 2, Part 3). We delved into the mathematical framework for analyzing information flow in Transformers (Part 4), established rigorous methodologies for validating features and circuits (Part 5), and surveyed a taxonomy of common circuit motifs, including specialized attention patterns (Part 6). A significant milestone in this journey was the detailed examination of induction heads, showcasing a concrete, reverse-engineered mechanism for in-context learning (Part 7).

This concluding installment turns to the cutting edge of the field, exploring two pivotal areas: firstly, advanced methodologies like Circuit Tracing, which promise a more granular, dynamic understanding of how models execute computations in response to specific inputs. Secondly, we will examine the crucial application of mechanistic interpretability techniques to State-of-the-Art (SOTA) models, gleaning insights from the internal workings of the largest and most capable AI systems currently available. These advancements are not merely incremental; they represent a significant leap in our ability to dissect and comprehend the increasingly complex algorithms learned by modern neural networks, with profound implications for safety, alignment, and the fundamental science of AI.

II. Circuit Tracing: Unveiling Step-by-Step Computational Graphs

Standard circuit analysis identifies the components (features, attention heads) and their static connections (effective weights). However, the actual computation performed by a model is dynamic and input-dependent. Circuit Tracing aims to reconstruct the specific “active” computational subgraph that a model executes for a given input, leading to a particular output. This provides a granular, step-by-step view of the model’s internal algorithm in action.

Motivation: From Static Blueprints to Dynamic Execution

Consider a complex circuit, like a multi-head attention block followed by an MLP. While we can analyze the $\mathbf{W}_{QK}, \mathbf{W}_{OV}$ , and MLP weight matrices in isolation, these static views don’t fully capture which specific information pathways are critical for processing a particular input sequence $\mathbf{X} = (x_1, x_2, \dots, x_T)$ to produce an output (e.g., predicting token $x_{T+1}$ ). Circuit Tracing addresses this by asking: “For input $\mathbf{X}$ , which specific neurons, attention links, and feature activations were most instrumental in reaching the observed output, and how did they quantitatively contribute?”

Conceptual Framework: The “Active Subgraph”

The core idea is to view the entire model computation as a vast Directed Acyclic Graph (DAG), $G = (V, E)$ , where:

Nodes $v \in V$ $v \in V$ represent activation vectors at specific points in the model:
- Token embeddings: $\mathbf{e}_t = \text{Embed}(x_t)$
- Residual stream states: $\mathbf{r}^{(l)}_t$ (output of layer $l-1$ at token $t$ )
- Attention head outputs: $\mathbf{h}^{(l,h)}_t = \text{Head}^{(l,h)}(\mathbf{r}^{(l)}_t, \text{context})$
- MLP layer outputs: $\mathbf{m}^{(l)}_t = \text{MLP}^{(l)}(\mathbf{r}^{(l)}_t)$
- Individual neuron/feature activations within these vectors.
Edges $e \in E$ represent the mathematical operations (matrix multiplications, attention computations, non-linearities, additions in the residual stream) that transform one set of activations into another.

For any given input $\mathbf{X}$ and a target output $Y$ (e.g., the logit for a specific token $y$ at position $T+1$ ), Circuit Tracing aims to identify a sparse, “active subgraph” $G'_{\mathbf{X},Y} \subset G$ that contains the primary computational path contributing to $Y$ .

Methodology: Influence Propagation and Graph Construction (Hypothetical Details for Ameisen et al., 2025)

Let’s outline a plausible methodology for “Circuit Tracing: Revealing Computational Graphs in Language Models” (Ameisen et al., March 2025). Such a method would likely combine forward computation with a backward attribution or influence-quantification pass.

Forward Pass & Activation Cache: Execute a standard forward pass for input $\mathbf{X}$ , caching all intermediate activations: $\mathcal{A} = \{ \mathbf{r}^{(l)}_t, \mathbf{q}^{(l,h)}_t, \mathbf{k}^{(l,h)}_j, \mathbf{v}^{(l,h)}_j, \alpha^{(l,h)}_{tj}, (\text{MLP}_{\text{pre}})^{(l,k)}_t, (\text{MLP}_{\text{post}})^{(l,k)}_t, \dots \}$ . Each activation is a tensor, indexed by layer, token position, head number (if applicable), and feature dimension.
Influence Attribution (Path-Specific): The core is to quantify the “influence” or “contribution” of upstream activations to downstream activations, particularly towards the target output $Y$ $Y$ .
- Output Node Definition: Define the target output node, e.g., $Y = (\text{Logits}_{T+1})_y$ , the logit of the target token $y$ . Assign it an initial influence $I(Y)=1$ .
- Backward Propagation of Influence: Iteratively propagate this influence backward through the computation graph.
  - Linear Layers: For a linear transformation $\mathbf{z} = \mathbf{W}\mathbf{a} + \mathbf{b}$ , if we have influence scores $I(z_k)$ for each output dimension $z_k$ , the influence propagated to an input dimension $a_j$ is $I(a_j) = \sum_k I(z_k) \frac{\partial z_k}{\partial a_j} = \sum_k I(z_k) W_{kj}$ . (Assuming row vectors for activations $\mathbf{aW}$ would mean $I(a_j) = \sum_k I(z_k) W_{jk}$ ).
  - Element-wise Non-linearities: For $z_k = \sigma(a_k)$ , $I(a_k) = I(z_k) \sigma'(a_k)$ . For ReLUs, $\sigma'(a_k)=1$ if $a_k>0$ and $0$ otherwise.
  - Attention Softmax: For $\alpha_{tj} = \text{Softmax}_j(s_{tj})$ , the Jacobian $\frac{\partial \alpha_{ti}}{\partial s_{tj}}$ is $\alpha_{ti}(\delta_{ij} - \alpha_{tj})$ . So, $I(s_{tj}) = \sum_i I(\alpha_{ti}) \alpha_{ti}(\delta_{ij} - \alpha_{tj})$ .
  - Attention Value Aggregation: For the head pre-output $\mathbf{u}_t = \sum_j \alpha_{tj} \mathbf{v}_j$ , influence $I(u_{tk})$ (component $k$ of $\mathbf{u}_t$ ) propagates to $I(\alpha_{tj}) = \sum_k I(u_{tk}) (v_j)_k$ and $I((v_j)_k) = \sum_k I(u_{tk}) \alpha_{tj}$ .
  - LayerNorm: $LN(\mathbf{x}) = \gamma \frac{\mathbf{x} - \mu}{\sigma + \epsilon} + \beta$ . Derivatives are standard. Influence flows back through $\gamma, \beta$ and the normalization.
  - Residual Connections: If $\mathbf{r}^{(l)}_t = \mathbf{r}^{(l-1)}_t + \Delta\mathbf{r}^{(l)}_t$ , influence $I(\mathbf{r}^{(l)}_t)$ propagates equally to $I(\mathbf{r}^{(l-1)}_t)$ and $I(\Delta\mathbf{r}^{(l)}_t)$ .
- Path Pruning: To manage complexity, at each step, edges or nodes whose propagated influence magnitude $\vert I(\cdot)\vert$ falls below a threshold $\tau$ are pruned. $\tau$ could be absolute or relative to the maximum influence at that stage.
Trace Graph $G'_{\mathbf{X},Y}$ Construction: The pruned graph contains nodes (specific tensor slices like $\mathbf{q}^{(l,h)}_{t, \text{dims}}$ or scalar activations $\alpha^{(l,h)}_{tj}$ ) and edges (operations) that survived. Edges can be weighted by the propagated influence or the magnitude of the involved weights/activations.

Example Mathematical Snippet from a Trace (Induction Head Scenario): Input: "... TokenP1 TokenQ1 ... TokenP2 ___". Target: Predict TokenQ1. A trace might highlight:

Head 1 (Previous Token Head @ TokenP2):
- $\mathbf{r}^{(l-1)}_{P2}$ (residual stream at TokenP2) $\xrightarrow{\mathbf{W}_Q^{(H1)}}$ $\mathbf{q}^{(H1)}_{P2}$ .
- $\mathbf{r}^{(l-1)}_{P2-1}$ (stream at token before P2) $\xrightarrow{\mathbf{W}_K^{(H1)}}$ $\mathbf{k}^{(H1)}_{P2-1}$ .
- High $s_{P2, P2-1}$ and $\alpha_{P2, P2-1}$ , indicating strong attention to the token just before P2 (or P2 itself, depending on head type). Let’s assume it attends to P2.
- $\mathbf{v}^{(H1)}_{P2} = \mathbf{r}^{(l-1)}_{P2} \mathbf{W}_V^{(H1)}$ .
- $\Delta \mathbf{r}^{(l)}_{P2} = (\alpha_{P2, P2} \mathbf{v}^{(H1)}_{P2}) \mathbf{W}_O^{(H1)}$ . This $\Delta \mathbf{r}^{(l)}_{P2}$ is effectively $V_{P2}$ , a representation of TokenP2, added to the residual stream.
Head 2 (Induction Head @ ___):
- Query formation: $\mathbf{q}^{(H2)}_{\_\_\_} = (\mathbf{r}^{(l'-1)}_{P2} + \Delta \mathbf{r}^{(l)}_{P2} + \dots) \mathbf{W}_Q^{(H2)}$ . The trace shows that the component $\Delta \mathbf{r}^{(l)}_{P2} \mathbf{W}_Q^{(H2)}$ is significant for $\mathbf{q}^{(H2)}_{\_\_\_}$ .
- Key formation for TokenP1: $\mathbf{k}^{(H2)}_{P1} = (\mathbf{r}^{(l'-1)}_{P1} + \text{earlier } \Delta \mathbf{r}^{(l)}_{P2} \text{ if H1 acts globally or P1 is P2} + \dots) \mathbf{W}_K^{(H2)}$ . The term $\Delta \mathbf{r}^{(l)}_{P2} \mathbf{W}_K^{(H2)}$ strongly shapes this key if K-composition is from the current P2’s features influencing past keys. (More accurately, P2’s features are in the query stream, affecting attention to P1’s key).
- High score for P1 and shift to Q1: The trace reveals that $s_{\_\_\_, Q1}$ (attention from ___ to TokenQ1) is high. This is because the QK circuit of H2, $(\mathbf{q}^{(H2)}_{\_\_\_}) (\mathbf{k}^{(H2)}_{Q1})^T$ , is maximized. The composition ensures $\mathbf{q}^{(H2)}_{\_\_\_}$ is seeking keys from tokens following an earlier P-like token. The specific structure of $\mathbf{W}_Q^{(H2)}\mathbf{W}_K^{(H2)T}$ combined with relative positional encodings allows the matching of $V_{P2}$ in the query to select the context of P1, then shift to Q1.
- Value Copying: $\mathbf{v}^{(H2)}_{Q1} = \mathbf{r}^{(l'-1)}_{Q1} \mathbf{W}_V^{(H2)}$ .
- $\Delta \mathbf{r}^{(l')}_{\_\_\_} = (\alpha_{\_\_\_,Q1} \mathbf{v}^{(H2)}_{Q1}) \mathbf{W}_O^{(H2)}$ . This output carries features of TokenQ1.
- This $\Delta \mathbf{r}^{(l')}_{\_\_\_}$ has a large projection onto $(\mathbf{W}_U)_{:,Q1}$ , boosting the logit for TokenQ1.

Algorithmic and Mathematical Insights

Circuit Tracing allows verification of such hypothesized multi-step algorithms by providing quantitative evidence for each information transfer and transformation. It makes the “virtual weights” and “compositional paths” discussed in Part 4 concrete and input-specific.

III. Mechanistic Insights from the Frontier: Probing SOTA Models

Applying MI to SOTA models tests the scalability of our techniques and understanding.

Case Study 1: Scaling Monosemanticity (Templeton et al., 2024 - Claude 3 Sonnet)

This work demonstrates the successful application of Sparse Autoencoders (SAEs) to a large, capable model, Claude 3 Sonnet.

SAE Architecture & Training:
- SAEs were trained on activations $\mathbf{x} \in \mathbb{R}^{d_{\text{model}}}$ from MLP layers (or residual stream) of Sonnet.
- Encoder: $\mathbf{f} = \text{ReLU}(\mathbf{W}_e (\mathbf{x} - \mathbf{b}_p)) \in \mathbb{R}^{d_{\text{dict}}}$ , where $d_{\text{dict}}$ (e.g., $8 \times d_{\text{model}}$ to $32 \times d_{\text{model}}$ ) is the overcompleteness factor. $\mathbf{W}_e \in \mathbb{R}^{d_{\text{dict}} \times d_{\text{model}}}$ .
- Decoder: $\hat{\mathbf{x}} = \mathbf{W}_d \mathbf{f} \in \mathbb{R}^{d_{\text{model}}}$ , where $\mathbf{W}_d \in \mathbb{R}^{d_{\text{model}} \times d_{\text{dict}}}$ (columns are feature directions $\mathbf{d}_j$ ). Often $\mathbf{W}_d$ is tied to $\mathbf{W}_e^T$ or constrained (e.g., columns of $\mathbf{W}_d$ normalized).
- Loss: $L = \vert\vert\mathbf{x} - \hat{\mathbf{x}}\vert\vert_2^2 + \lambda \sum_j \vert\vert\mathbf{f}_j\vert\vert_1$ . The L1 penalty on feature activations $\mathbf{f}_j$ is crucial for sparsity.
- The choice of $\lambda$ is critical: too low, and features are dense and polysemantic; too high, and reconstruction suffers (information bottleneck). A common approach is to sweep $\lambda$ values and evaluate the trade-off between reconstruction loss and the average number of active features per input (L0 norm of $\mathbf{f}$ ).
Key Technical Findings & Examples:
1. High-Quality Interpretable Features: Despite Sonnet’s scale, a significant number of learned feature directions $\mathbf{d}_j$ $d_{j}$ were interpretable via examining maximal activating dataset examples.
  - Example: A feature for “Python code variable assignment” might activate strongly on text like “count = 0”, “user_name = 'test'”. This feature $\mathbf{d}_{\text{py\_assign}}$ would be a specific vector direction in Sonnet’s activation space.
2. Safety-Relevant Features & Causal Impact:
  - Example: A feature $\mathbf{d}_{\text{violence\_}}$ activating on descriptions of violence. Ablating this feature (i.e., if $\mathbf{x}_{\text{patched}} = \mathbf{x} - f_{\text{violence\_}}\mathbf{d}_{\text{violence\_}}$ is used in downstream computation, or more practically, zeroing out $f_{\text{violence\_}}$ in the SAE activations $\mathbf{f}$ before reconstruction) might measurably reduce the model’s propensity to continue generating violent content in relevant contexts. This involves experiments where $P(\text{output }\vert\text{ do}(\text{ablate } f_{\text{violence\_}}))$ is compared to $P(\text{output})$ .
3. Feature Sparsity: Successful training yielded highly sparse $\mathbf{f}$ vectors, meaning for any given input, only a small subset of the $d_{\text{dict}}$ features were active. This is essential for both interpretability and for the SAE to effectively de-mix superposed signals. The L0 norm of $\mathbf{f}$ might be in the tens or low hundreds, while $d_{\text{dict}}$ is many thousands.
4. Reconstruction Quality: The reconstructed activations $\hat{\mathbf{x}}$ were close enough to $\mathbf{x}$ to suggest that the SAE feature basis captured most of the functionally relevant information. This is measured by $\vert\vert\mathbf{x} - \hat{\mathbf{x}}\vert\vert_2^2 / \vert\vert\mathbf{x}\vert\vert_2^2$ (fraction of unexplained variance).

Case Study 2: Mechanistic Insights from Claude 3.5 Haiku (Lindsey et al., 2025)

The paper “On the Biology of a Large Language Model” investigates the internal mechanisms of Claude 3.5 Haiku using attribution graphs and circuit tracing. Key findings include:

Multi-step Reasoning: The model performs genuine multi-hop reasoning, with internal features representing intermediate concepts (e.g., inferring “Texas” from “Dallas” before outputting “Austin”). Attribution graphs and intervention experiments confirm that these intermediate steps are causally involved in the model’s output.
Planning in Poems: When writing poetry, the model plans ahead by activating features for candidate rhyming words before composing the line, showing both forward and backward planning mechanisms.
Multilingual Circuits: The model uses both language-specific and language-independent circuits, with the latter more prominent in larger models. This enables generalization across languages.
Addition and Arithmetic: The same addition circuitry generalizes across different contexts, suggesting a robust, reusable mechanism for arithmetic.
Medical Diagnoses: The model internally represents candidate diagnoses and uses them to inform follow-up questions, demonstrating internal reasoning not always explicit in its output.
Entity Recognition and Hallucinations: Circuits distinguish between familiar and unfamiliar entities, affecting whether the model answers or professes ignorance. Misfires in these circuits can cause hallucinations.
Refusals: The model aggregates features for specific harmful requests into a general-purpose “harmful requests” feature during finetuning, which is used to trigger refusals.
Jailbreaks: The paper analyzes how certain attacks can bypass safety by exploiting the model’s internal mechanisms, such as activating benign request features to mask harmful intent.
Chain-of-thought Faithfulness: The model sometimes genuinely performs the steps it claims in chain-of-thought, but sometimes does not. Attribution graphs can distinguish between faithful and unfaithful reasoning.
Hidden Goals: The method can reveal mechanisms for hidden goals in finetuned models, even when the model avoids revealing them in its output.

All findings are supported by attribution graphs, intervention experiments, and detailed case studies. The paper emphasizes the limitations and partial nature of these insights, and encourages further research.

Emerging Technical Themes from SOTA Research

Universality (Model Diffing & Sparse Crosscoders - Oct 2024, Feb 2025 updates):
- Sparse Autoencoders as a “Canonical Basis”: If SAEs consistently find similar features across different models (M1, M2 trained on similar data) or layers, it suggests these features are fundamental.
- Technical Approach (Crosscoders): Train an SAE (the “crosscoder”) on activations from M1. Then, use this same SAE to encode activations from M2. If the resulting feature activations $\mathbf{f}_{M2}$ are sparse and allow good reconstruction of M2’s original activations using the crosscoder’s decoder $\mathbf{W}_{d,\text{crosscoder}}$ , it implies M2 uses similar features to M1.
- Alternatively, train SAEs independently for M1 ( $\mathbf{W}_{d1}, \mathbf{f}_1$ ) and M2 ( $\mathbf{W}_{d2}, \mathbf{f}_2$ ). Then try to find an optimal linear map (permutation and scaling) $\mathbf{P}$ such that $\vert\vert\mathbf{W}_{d1} - \mathbf{W}_{d2}\mathbf{P}\vert\vert_F$ is minimized, or feature-wise cosine similarities $\text{sim}(\mathbf{d}_{1,i}, \mathbf{d}_{2,j})$ are maximized for matched pairs.
The “Dark Matter” Problem (July 2024 update):
- Even high-quality SAEs might achieve, say, 80-90% reconstruction fidelity (measured by variance explained). The remaining 10-20% is “dark matter.”
- Mathematical Characterization: Is this residual $(\mathbf{x} - \hat{\mathbf{x}})$ noise-like (uncorrelated with inputs or outputs)? Or does it contain structured information that is simply not well-represented by a sparse linear combination of the current dictionary features (e.g., highly non-linear relationships, or features at a different scale of sparsity)? Analyzing its spectral properties or correlations could offer clues.
MI for Safety - Jailbreaking Circuits (April 2025 update):
- Hypothesize a “jailbreak” occurs when an adversarial prompt $\mathbf{X}_{\text{adv}}$ causes a suppression of normal safety features and activation of a “deceptive alignment” pathway.
- Technical Signature:
  - Feature $\mathbf{d}_{\text{safety\_guard}}$ (normally active for harmful queries, leading to refusal) has low activation $f_{\text{safety\_guard}}(\mathbf{X}_{\text{adv}})$ .
  - Feature $\mathbf{d}_{\text{evasion\_}}$ (normally inactive) has high activation $f_{\text{evasion\_}}(\mathbf{X}_{\text{adv}})$ .
  - Circuit tracing might reveal that specific tokens in $\mathbf{X}_{\text{adv}}$ cause an attention head $H_{\text{suppress}}$ to write a vector to the residual stream that has a strong negative projection onto $\mathbf{d}_{\text{safety\_guard}}$ ’s input direction in a downstream MLP, effectively nullifying it. Simultaneously, other parts of $\mathbf{X}_{\text{adv}}$ might activate $\mathbf{d}_{\text{evasion\_}}$ through a different pathway.
- Validating this would involve patching activations of $\mathbf{d}_{\text{safety\_guard}}$ back to normal levels on $\mathbf{X}_{\text{adv}}$ and seeing if the jailbreak is prevented.
Feature-Based Classifiers for Safety (Oct 2024 update):
- Train a simple linear classifier (e.g., logistic regression) $p(\text{harmful} \vert \mathbf{f}) = \sigma(\mathbf{w}^T \mathbf{f} + b)$ on the SAE feature activations $\mathbf{f}$ extracted from a model layer.
- Compare its performance (AUC, F1-score) and calibration to a classifier trained on raw activations $\mathbf{x}$ . The hypothesis is that the feature-based classifier might be more robust or its decision boundary more interpretable due to the monosemanticity of its inputs. The weights $\mathbf{w}$ would directly indicate which interpretable features contribute to classifying content as harmful.

IV. The Evolving Interpretability Toolkit

The advanced analyses above are supported by an evolving toolkit:

Advanced SAE Architectures: Research explores variants like TopK SAEs (only allowing the top K feature activations to be non-zero, enforcing extreme sparsity) or Gated SAEs (where feature activations are modulated by a learned gating mechanism). These aim to improve feature quality, reduce interference, or better model feature interactions. (June 2024 update “topk and gated SAE investigation”).
Optimization Techniques for Dictionary Learning: Beyond standard SGD, specialized optimization methods are explored to better navigate the non-convex landscape of SAE training, especially at scale (Jan 2025 update). This might involve specific learning rate schedules, regularizers, or initialization strategies.
Automated Feature Interpretation & Validation: Efforts to automate the process of generating hypotheses for feature function (e.g., using LLMs to summarize maximal activating examples) and performing basic validation tests (e.g., automated ablation studies) are crucial for handling the sheer number of features from SOTA SAEs.

V. Conclusion and Future Outlook

Mechanistic interpretability is rapidly maturing from foundational explorations in smaller models to tackling the colossal complexity of State-of-the-Art AI systems. Advanced methodologies like Circuit Tracing offer unprecedented granularity in understanding dynamic, input-specific computations. Simultaneously, the application of techniques like scaled dictionary learning to SOTA models like the Claude series is demonstrating that their internal representations, while incredibly rich, are not entirely inscrutable. We are beginning to extract and understand thousands of monosemantic features, including those critical for safety and alignment.

The journey reveals a synergistic relationship: new analytical tools enable deeper probes into SOTA models, and the challenges encountered in these massive systems inspire further methodological innovation. Key mathematical and technical frontiers include robustly defining and tracing influence in non-linear systems, efficiently finding corresponding features across different models (the universality problem), understanding the nature of “dark matter” in activations, and scaling the discovery and validation of complex, compositional circuits.

The ultimate goal remains the development of a comprehensive, causal understanding of how these powerful AI systems operate—transforming them from black boxes into transparent, analyzable, and ultimately more trustworthy computational entities. Resources like transformer-circuits.pub are vital for tracking the rapid pulse of this exciting and critical field.

VI. References

Ameisen, E., et al. (2025). Circuit Tracing: Revealing Computational Graphs in Language Models. Transformer Circuits Thread.
Templeton, A., et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread.
Lindsey, J., et al. (2025). On the Biology of a Large Language Model. Transformer Circuits Thread.
Anthropic Interpretability Team. (April 2025). Circuits Updates — April 2025. Transformer Circuits Thread.
Anthropic Interpretability Team. (February 2025). Insights on Crosscoder Model Diffing. Transformer Circuits Thread.
Anthropic Interpretability Team. (January 2025). Circuits Updates — January 2025. Transformer Circuits Thread.
Anthropic Interpretability Team. (October 2024). Sparse Crosscoders for Cross-Layer Features and Model Diffing. Transformer Circuits Thread.
Anthropic Interpretability Team. (October 2024). Using Dictionary Learning Features as Classifiers. Transformer Circuits Thread.
Anthropic Interpretability Team. (July 2024). Circuits Updates — July 2024. Transformer Circuits Thread.
Anthropic Interpretability Team. (June 2024). Circuits Updates — June 2024. Transformer Circuits Thread.
Elhage, N., et al. (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread.