Software Project: Recent Advances in Mechanistic Interpretability

time/location TBD

Tanja Bäumel, Tatiana Anikina, Dr. Simon Ostermann
Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI)
Prerequisites: This seminar is targeted at Master students and will require students to implement mechanistic approaches on a low level of language models. We thus expect you to have a curious mind and advanced familiarity with large language models. If you would like to participate, you need to visit the Mechanistic Interpretability seminar or proof to us otherwise that you bring the necessary knowledge to implement mechanistic approaches.
This is a software project that will take place during the semester break after the lecture period in the next winter semester, i.e. in February and March 2025. If you want to take part, please drop an email to mechanistic-interpretability-seminar@googlegroups.com (deadline TBD). In your email, please:
  • Give your name, semester, study program
  • Write some words on why you want to take part in this course
  • List some of your previous experience:
    • your background in deep learning or machine learning
    • your background in natural language processing in general

Project Timeline

  • end of lectures: Feb 07, 2025
  • Intro Week (in person): Feb 17 - Feb 21
  • Weekly check in with groups (online): 4 weeks, Feb 24 - Mar 21
  • Final week (in person): Mar 24 - Mar 28
  • end of semester: Mar 31

Project Content

The rise of deep learning in AI has dramatically increased the performance of models across many sub-fields such as natural language processing or computer vision. In the last 5 years, large pretrained language models (LLMs) and their variants (BERT, ChatGPT etc.) have changed the NLP landscape drastically. Such models got larger and larger over the last years, reaching increasingly impressive performance peeks, sometimes even surpassing humans.

A central issue with deep learning models with millions or billions of parameters is that they are essentially black boxes: From the model's parameters, it is not inherently clear why a model exhibits a certain behavior or makes certain classification decision. The rapidly growing field of interpretable and explainable AI (XAI) develops methods to peek into the black box that LLMs are, trying to understand the inner workings of such large models.

In this seminar we will investiagte a subfield in XAI, namely Mechanistic Interpretability (MI). MI aims to understand and explain the internal workings of complex machine learning models, in particular deep neural networks, by "reverse-engineering" internal mechanisms and computations inside a network's parameters. This involves analysing how certain mechanisms in the model contribute to decisions and results. The aim is to break down the ‘black box’ nature of these models and reveal the underlying calculations and internal representations that lead to predictions or behaviours.

Students in the course will be introduced to a range of toolkits and methods inside the field of MI and will pick their own topic (single students or groups of two) within MI. They will invetigate a language model of their choice with a technique of their choice for some specific phenomenom and submit both code for reproducing their experiments and a project report.

Projects can (but don't have to) be based on Neel Nanda's list of 200 open problems in mechanistic interpretability.