Mechanistic Interpretability and Explainable AI

This Special Track, invites submissions that investigate the internal workings of neural networks to establish connections between low-level computations and high-level, human-interpretable concepts. Mechanistic Interpretability (MI) aims to provide precise insights into how models process information, offering tools and frameworks to analyze their behavior at a structural level. Topics of interest include methods such as Sparse Autoencoders (SAEs), which learn sparse, overcomplete representations of neural activations. SAEs are effective for disentangling overlapping features, addressing challenges like superposition, where multiple features share the same neurons. Such techniques enable a clearer understanding of model internals and support tracing causal relationships between inputs, intermediate representations, and outputs. The track also seeks work that explores how MI complements existing xAI approaches. While xAI methods often focus on producing interpretable explanations for predictions, MI provides detailed analyses of the mechanisms underlying these predictions. Both approaches contribute to creating AI systems that are transparent, reliable, and aligned with human values. Submissions are encouraged that advance the use of MI for practical applications, propose novel interpretability methods, or integrate MI with symbolic reasoning to develop actionable insights. Research bridging these areas will contribute to improving AI systems for safe and ethical deployment in critical domains.

Mechanistic Interpretability for bias mitigation and model alignment.
Feature visualization techniques for polysemantic neurons.
Identifying causal pathways in transformer attention heads.
Mechanistic analysis of circuits in multi-layer perceptrons.
Reverse-engineering token embeddings in language models.
Probing positional encoding mechanisms in transformers.
Using path patching to trace causal model computations.
Understanding induction heads in autoregressive transformers.
Evaluating neuron interpretability in overparameterized networks.
Analysis of layer-wise contribution to network outputs.
Mechanistic interpretability in sparse transformer architectures.

Studying polysemanticity reduction via architectural constraints.
Quantifying feature superposition in feedforward layers.
Circuit-based tracing of logical operations in neural networks.
Investigating redundancy in neuron activations for interpretability.
Explaining emergent behavior in large-scale transformer models.
Decomposing attention patterns in transformer-based language models.
Mechanistic understanding of self-attention bottlenecks.
Analyzing the role of multi-head attention diversity.
Bridging sub-symbolic and symbolic representations.
Layer-wise dissection of context aggregation in transformers.