Simon Ostermann

If you would like to participate, please drop an email to xpln-seminar@dfki.de
until April 17 (23:59).

In your email, please:

Give your name, semester, study program
Write some words on why you want to take part in this course
List some of your previous experience:
- your background in deep learning or machine learning
- your background in natural language processing in general

Prerequisites: This seminar is primarily targeted at Master students, but is also open to advanced Bachelor students. We expect you to have a curious mind and some familiarity with large language models. At the very least, we expect all students to have read (and understood :-)) the BERT paper and the Transformer paper.

Seminar Content

The rise of deep learning in AI has dramatically increased the performance of models across many sub-fields such as natural language processing or computer vision. In the last 5 years, large petrained language models (LLMs) and their variants (BERT, ChatGPT etc.) have changed the NLP landscape drastically. Such models got larger and larger over the last years, reaching increasingly impressive performance peeks, sometimes even surpassing humans.

A central issue with deep learning models with millions or billions of parameters is that they are essentially black boxes: From the model's parameters, it is not inherently clear why a model exhibits a certain behavior or makes certain classification decision. Understanding the inner workings of such large models is however extremely important, especially when AI takes on critical tasks e.g. in the medical or financial domain. Trustworthiness and fairness are important dimensions that such large models should adhere to, that are often not taken into account.

In this seminar we will try to shine a spotlight on the rapidly growing field of interpretable and explainable AI (XAI), that develops methods to peak into the black box that LLMs are. We will introduce general methods used in XAI, and look into insights gained from applying these methods to LLMs. Depending on the students' preferences, we will cover some of the topics listed below.

List of relevant Papers and Topics (subject to smaller changes)

Basic XAI Methods

Probing: What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties | paper
Amnesic Probing: Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals | paper
LIME: "Why Should I Trust You?": Explaining the Predictions of Any Classifier | paper
Attention Interpretation:
- Is Attention Interpretable? | paper
- Attention is not Explanation | paper
- Attention is not not Explanation | paper
Axiomatic Attribution for Deep Networks | paper

Linguistic Interpretations of Language Models

Prompt Learning: Personalized Prompt Learning for Explainable Recommendation | paper
Syntax:
- Open Sesame: Getting Inside BERT’s Linguistic Knowledge | paper
- A Structural Probe for Finding Syntax in Word Representations | paper
Semantics :
- What do you learn from context? Probing for sentence structure in contextualized word representations | paper
- What’s in a Name? Are BERT Named Entity Representations just as Good for any other Name? | paper
(World) Knowledge:
- Do Neural Language Representations Learn Physical Commonsense? | paper
- Language Models as Knowledge Bases? | paper
- BERT is Not a Knowledge Base (Yet): Factual Knowledge vs. Name-Based Reasoning in Unsupervised QA | paper
- Do NLP Models Know Numbers? Probing Numeracy in Embeddings | paper
- Language Models Represent Space and Time | paper

Mechanistic Interpretability

Basics:
- Zoom In: An Introduction to Circuits | paper
- A Mathematical Framework for Transformer Circuits | paper
- Transformer Feed-Forward Layers Are Key-Value Memories | paper
Knowledge: Locating, Extracting, Editing & Forgetting:
- Crawling The Internal Knowledge-Base of Language Models | paper
- Locating and Editing Factual Associations in GPT | paper
- Mass-Editing Memory in a Transformer | paper
- Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models | paper
- In-Context Unlearning: Language Models as Few Shot Unlearners | paper
World Representation:
- Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task | paper
- Emergent Linear Representations in World Models of Self-Supervised Sequence Models | paper

Interpreting Individual Neurons or Model Weights

Neuron Interpretations:
- Knowledge Neurons in Pretrained Transformers | paper
- Investigating the Encoding of Words in BERT’s Neurons Using Feature Textualization | paper
- On the Pitfalls of Analyzing Individual Neurons in Language Models | paper
- An Interpretability Illusion for BERT | paper
- Finding Neurons in a Haystack: Case Studies with Sparse Probing | paper
Model Weights:
- Analyzing Transformers in Embedding Space | paper
- Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts | paper
- Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space | paper
- Backward Lens: Projecting Language Model Gradients into the Vocabulary Space | paper
- The Hidden Space of Transformer Language Adapters | paper

Understanding and Reasoning

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? | paper
Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data | paper
On the paradox of learning to reason from data | paper
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models | paper
A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis | paper
Reasoning with Language Model is Planning with World Model | paper
Understanding the Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation | paper

Fairness and Bias

Measuring Fairness with Biased Rulers: A Comparative Study on Bias Metrics for Pre-trained Language Models | paper
Bigger Data or Fairer Data? Augmenting BERT via Active Sampling for Educational Text Classification | paper
Mitigating Language-Dependent Ethnic Bias in BERT | paper
Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model | paper
Investigating Gender Bias in Language Models Using Causal Mediation Analysis | paper

Can Models explain Models? On the Faithfulness of Explanations

Language models can explain neurons in language models | paper
Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models | paper

Some words on grading: This seminar is meant to be as interactive as possible. Final grades will be based on students' presentations, term papers (optional), but also on participation and discussion in class.

The participants are expected to prepare for classes accordingly, by reading the relevant papers and also doing background reading, if necessary. Based on this preparation, the participants should be able to discuss the presented papers in depth and to understand relevant context during the discussion.

Seminar XPLN: Exploring Explainability in NLP

Wed 14:15–15:45, Room -1.05

Dr. Simon Ostermann, Tanja Bäumel

Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI)

If you would like to participate, please drop an email to xpln-seminar@dfki.de
until April 17 (23:59).

Seminar Content

List of relevant Papers and Topics (subject to smaller changes)

Basic XAI Methods

Linguistic Interpretations of Language Models

Mechanistic Interpretability

Interpreting Individual Neurons or Model Weights

Understanding and Reasoning

Fairness and Bias

Can Models explain Models? On the Faithfulness of Explanations

Seminar XPLN: Exploring Explainability in NLP

Wed 14:15–15:45, Room -1.05

Dr. Simon Ostermann, Tanja Bäumel

Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI)

If you would like to participate, please drop an email to xpln-seminar@dfki.de until April 17 (23:59).

Seminar Content

List of relevant Papers and Topics (subject to smaller changes)

Basic XAI Methods

Linguistic Interpretations of Language Models

Mechanistic Interpretability

Interpreting Individual Neurons or Model Weights

Understanding and Reasoning

Fairness and Bias

Can Models explain Models? On the Faithfulness of Explanations

If you would like to participate, please drop an email to xpln-seminar@dfki.de
until April 17 (23:59).