Enhancing Models with XAI: Debugging, Modification, Evaluation

The rapid evolution of machine learning has led to great advances in both the sciences and commercial applications. Despite these achievements, the increasing size and complexity of deep neural networks presents a challenge in ensuring reliability, fairness, and robustness. This session will address state-of-the-art techniques for debugging, modifying, and evaluating the improvement of models to enhance their performance and trustworthiness using explainable AI techniques. Topics include identifying and mitigating the effect of spurious correlations (Clever Hans effect), diagnosing failure modes, and improving models through targeted corrections. For each of these points, XAI methods form a critical foundation. In addition, rigorous evaluation metrics to ensure that the model enhancements had the intended effect are included. Join us to discuss best practices, emerging tools, techniques, and case studies highlighting how thoughtful model enhancement can lead to fair, robust, and responsible AI solutions.

Through the introduction of this dedicated session, we have the potential to significantly contribute to the development of novel approaches, enhance existing methods, and explore practical applications for XAI-based model enhancement. By bringing together researchers of relevant topics and showcasing the most recent research on model enhancement, this special session aligns with the broader objective of ensuring model reliability, fairness, and robustness.

Keywords: Model debugging, model improvement, model correction, spurious correlations, Clever Hans effect, shortcuts, shortcut learning, spurious features, model failure diagnosis, model evaluation, model correction evaluation, adversarial vulnerabilities, concept activation vectors, unlearning, mechanistics interpretability for debugging, model correction robustness evaluation, concept dependence, spurious training data, spurious concepts, feature steering, activation patching, bias correction, training data correction.

List of topics

Revealing spurious behavior: To handle large models and large datasets, automated approaches that effectively reveal spurious behavior and reduce the need for human inspection are important.
Approaches could be based on, e.g., summarizing local explanations (Melody, PCX, SpRAy), structuring latent spaces, or identifying outlier model components (DORA).
Learning in the presence of spurious correlations: When, why, and how are spurious correlations
learned
Automatically probing large models (e.g., LLMs): Probing and searching for the presence of spurious features (e.g. via SAE weights)
Revealing spurious features in applications: Methods for finding spurious features e.g. in medical settings, the natural sciences, or industrial applications, using XAI
Shortcuts in vision-language models: Analyses of shortcut learning in foundation models using XAI (e.g. automated attribution analysis, counterfactual explanations)
Finding adversarial model vulnerabilities: Analyses of adversarial vulnerabilities using XAI
Concept Activation Vectors: Reliable and faithful estimation of directions or regions in latent space that correspond to a (spurious or harmful) concept (TCAV, RCAV, signal-CAV)
Model correction: Modification of models with shortcut behavior to reduce their reliance on spurious correlations (EditingClassifiers, P-ClArC)
Unlearning learned spurious correlations e.g. via targeted fine-tuning (A-ClArC, DISC, RR-ClArC, DFR)
Data cleaning and augmentation: Detection and cleaning of spurious training data (e.g., with generative models) or augmenting datasets serving as a model-agnostic way to correct models (e.g., DISC)
Correcting social biases of foundation models (e.g. by last-layer modifications)
Training-free/post-hoc model correction via pruning or modifications of model components using limited or synthetic data (e.g. by attribution-informed pruning criteria, steering of concept directions via P-ClArC)
Mechanistic Interpretability for model debugging/correction: Understanding and modifying the role of individual model components via feature steering or activation patching
Machine unlearning: Unlearning specific (groups) of data samples, e.g. for privacy preservation and regulatory compliance (e.g., FMD)
Evaluating concept dependence: Advancing techniques to evaluate dependence on spurious concepts (e.g. TCAV)
Model improvement evaluation benchmarks: Benchmarks that take into account full training, fine-tuning, or post-hoc approaches (e.g., Spawrious)
Meta-evaluations: Evaluations of which evaluation techniques are reliable and which are not, and how can we improve evaluation techniques
Advanced model improvement evaluation settings: Realistic or difficult benchmarks for model robustness (e.g. multiple spurious artifacts), sparse data domains, to evaluate the success of model correction
Model improvement evaluation techniques: Beyond worst-group accuracy, for settings where labels are only partially available, and or evaluations that more directly measure model reliance (e.g. using
counterfactual explanations, generative models such as Fast-DiME, or influence functions as in FMD)
XAI-based Model Correction Frameworks: Frameworks that allow to reveal, revise and evaluate models (e.g., Reveal2Revise, explanatory interactive machine learning, DISC)