Mechanistic Interpretability in Explainable AI

This Special Track aims to advance Explainable AI (xAI) by investigating the internal processes of neural networks to provide deeper insights into their decision-making mechanisms. This track will explore how Mechanistic Interpretability (MI) can bridge the gap between low-level neural computations and high-level, human-understandable concepts, enhancing transparency and trust.

A key focus will be on leveraging neuro-symbolic integration to combine neural network functionality with symbolic reasoning, providing clearer, more actionable explanations. Additionally, the track will address how mechanistic insights enable tracing causal relationships within models, supporting the development of techniques for model steering and alignment to human values and principles. By ensuring models are interpretable and aligned with ethical guidelines, this research contributes to creating reliable AI systems capable of operating safely in high-stakes environments.

Keywords: Multimodal representation learning, Mechanistic interpretability, scaling Interpretability to large models, Disentangling emergent properties, Automated circuit discovery methods

Topics

Mechanistic Interpretability for bias mitigation and model alignment.
Metrics and benchmarks for mechanistic Interpretability evaluation.
Multimodal representation learning.
Mechanistic Interpretability vs concept-based learning
Disentangling emergent properties in deep learning architectures.
Analyzing induction heads and in-context learning.
Feature visualization techniques for polysemantic neurons.
Automated circuit discovery methods.
Challenges in scaling Interpretability to large models.
Investigating representational similarity across models and trainings