Seminar: Recent Advances in Mechanistic Interpretability

CHANGED: Thursday 16:15–17:45, Room "Leibniz" at DFKI

Dr. Simon Ostermann, Natalia Skachkova
Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI)
If you would like to participate, please drop an email to mechanistic-interpretability-seminar@googlegroups.com until October 17 (23:59). In your email, please:
  • Give your name, semester, study program
  • Write some words on why you want to take part in this course
  • List some of your previous experience:
    • your background in deep learning or machine learning
    • your background in natural language processing in general
Prerequisites: This seminar is primarily targeted at Master students, but is also open to advanced Bachelor students. We expect you to have a curious mind and advanced familiarity with large language models. At the very least, we expect all students to have read (and understood :-)) the BERT paper and the Transformer paper.

Seminar Content

The rise of deep learning in AI has dramatically increased the performance of models across many sub-fields such as natural language processing or computer vision. In the last 5 years, large pretrained language models (LLMs) and their variants (BERT, ChatGPT etc.) have changed the NLP landscape drastically. Such models got larger and larger over the last years, reaching increasingly impressive performance peeks, sometimes even surpassing humans.

A central issue with deep learning models with millions or billions of parameters is that they are essentially black boxes: From the model's parameters, it is not inherently clear why a model exhibits a certain behavior or makes certain classification decision. The rapidly growing field of interpretable and explainable AI (XAI) develops methods to peek into the black box that LLMs are, trying to understand the inner workings of such large models.

In this seminar we will investiagte a subfield in XAI, namely Mechanistic Interpretability (MI). MI aims to understand and explain the internal workings of complex machine learning models, in particular deep neural networks, by "reverse-engineering" internal mechanisms and computations inside a network's parameters. This involves analysing how certain mechanisms in the model contribute to decisions and results. The aim is to break down the ‘black box’ nature of these models and reveal the underlying calculations and internal representations that lead to predictions or behaviours.

We will concentrate on a range of basic methods in MI and then focus on their applications within natural language processing.


List of relevant Papers and Topics (subject to changes)

Mechanistic Interpretability Methods
  • Vig et al. (2020) – Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias | paper
  • Nostalgebraist (2020) - Interpreting GPT: the logit lens | blog post
  • Geiger et al. (2021) – Causal Abstractions of Neural Networks | paper
  • Goldowsky-Dill et al. (2023) – Localizing Model Behavior with Path Patching | paper
  • Conmy et al. (2023) – Towards Automated Circuit Discovery for Mechanistic Interpretability | paper
Findings of Mechanistic Interpretability
  • Wang et al. (2023) – Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small | paper
  • Hanna et al. (2023) – How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model | paper
  • Prakash et al. (2024) – Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking | paper
  • Merullo et al. (2024) – Circuit Component Reuse Across Tasks in Transformer Language Models | paper
  • Nanda et al. (2023) – Progress measures for grokking via mechanistic interpretability | paper
  • Geva et al. (2023) - Dissecting Recall of Factual Associations in Auto-Regressive Language Models – Geva et al. (2023) | paper
  • Merullo et al. (2024) - Talking Heads: Understanding Inter-layer Communication in Transformer Language Models | paper
  • Gould et al. (2024) – Successor Heads: Recurring, Interpretable Attention Heads In The Wild | paper
  • McDougall et al. (2024) – Copy Suppression: Comprehensively Understanding an Attention Head | paper
  • Gurnee et al. (2023) – FINDING NEURONS IN A HAYSTACK: CASE STUDIES WITH SPARSE PROBING | paper
  • Geva et al. (2022) - Transformer Feed-Forward Layers Are Key-Value Memories | paper
  • Todd et al. (2024) – Function Vectors in Large Language Models | paper
  • Nanda et al. (2022) – Emergent Linear Representations in World Models of Self-Supervised Sequence Models | paper
  • Dai et al. (2022) - Knowledge Neurons in Pretrained Transformers | paper
Model Editing Editing, Knowledge Location and Extraction
  • Cohen et al. (2023) - Crawling The Internal Knowledge-Base of Language Models | paper
  • Meng et al. (2022) - Locating and Editing Factual Associations in GPT | paper
  • Hase et al. (2023) - Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models | paper
  • Chintam et al. (2023) - Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model | paper
  • Cohen et al. (2023) - Evaluating the Ripple Effects of Knowledge Editing in Language Models | paper
  • Hartvigsen et al. (2022) - Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors | paper
Sparse Autoencoders and Monosemanticity
  • Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (2023) | blog post
  • Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (2024) | blog post
  • Marks et al. (2024) - Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models | paper
  • Kissane et al. (2024) - Interpreting Attention Layer Outputs with Sparse Autoencoders | paper

Some words on grading: This seminar is meant to be as interactive as possible. Final grades will be based on students' presentations, term papers (optional), but also on participation and discussion in class.

The participants are expected to prepare for classes accordingly, by reading the relevant papers and also doing background reading, if necessary. Based on this preparation, the participants should be able to discuss the presented papers in depth and to understand relevant context during the discussion.