Workshop

NeurIPS 2024

In Defence of Post-hoc Explainability

Galileo's telescope had “chromatic aberration” - an optical flaw that produced colored halos that didn't exist. Strictly speaking, his instrument was broken, showing false information mixed with truth. Today's AI explanation methods (XAI) are our chromatic aberration: they distort what’s really happening inside the neural network, mixing false signals with real ones. But just as Galileo's flawed lens revealed Jupiter's moons, our flawed explanations could reveal nature's hidden patterns. Sometimes the right discovery comes through the wrong instrument. And that's precisely why these methods don't need to be computationally faithful to be scientifically useful.
The Crisis Nobody Talks About

Every major AI lab studying scientific problems faces the same uncomfortable truth. They use explanation methods to understand what their models discovered. They publish visualisations showing which features matter most. But mathematically, provably, these methods don't actually explain what the AI computed.

Recent proofs demonstrate these explanation tools fail catastrophically. They perform worse than random guessing at identifying what neural networks actually use. Run the same method twice with slightly different settings, get completely different answers. Design adversarial examples that fool them trivially.

The mathematical proofs are unambiguous. These methods don't do what they claim. And yet, researchers keep using them to generate new findings — suggesting that either we’re missing something about their hidden value, or that our standards for discovery aren’t what we think they are.

Three Functions, One Paradox

Critics evaluate explanation methods by asking whether they accurately represent neural network computations. When they don't — and they rarely do — the methods are declared broken.

But scientists aren't trying to understand neural networks. They're trying to understand reality. The neural network is just a sophisticated pattern detector that notices things humans miss.

Consider three nested functions.

  • Reality: How nature actually works
  • Model: What the neural network learned (millions of parameters we can't interpret)
  • Explanation: What our methods extract (simplified patterns humans can grasp)

Critics obsess over whether explanations faithfully represent models. But what if explanations can be wrong about models while revealing truths about reality?

When Broken Methods Find Real Knowledge

A recent study from University of Oxford, Google DeepMind and Goodfire AI demonstrates that even when explanation methods misrepresent a model's computations, they can still reveal novel insights about the world. Researchers extracted chess concepts from AlphaZero — an AI that mastered chess through self-play, without human knowledge. Their extraction method used convex optimisation to find concept vectors in AlphaZero's neural representations.

These vectors don't precisely represent AlphaZero's actual computations. The extraction process compresses millions of parameters into sparse vectors through lossy approximations. By the critics' standards, this method fails.

Yet when these “broken” concept vectors were used to generate training puzzles for human experts — including world champions Vladimir Kramnik and Hou Yifan — something unexpected happened. One grandmaster improved the performance by up to 42% after studying positions that exemplified these extracted concepts.

One concept involved quiet moves that improve piece positioning while preparing unconventional long-term sacrifices. Grandmaster commentary described these as “clever”, “not natural”, and containing ideas that were “hard to spot” even after seeing the solution. The concepts combined known chess principles in novel ways that violated conventional strategic thinking.

The extraction method failed to faithfully represent AlphaZero's computations. But it succeeded in identifying patterns that expanded human chess knowledge at the highest level.

Bounded Truth

This phenomenon — where broken explanations generate valid insights — isn't a fluke. It reflects how knowledge transfers between incompatible forms of intelligence through what we call “mediated understanding”.

The explanation methods don't directly reveal what neural networks compute. Instead, they mediate between two incompatible systems: AI that processes information in thousands of dimensions, and humans who think in causes and stories. This mediation is necessarily lossy, approximate, often wrong. But it creates a bridge where none should exist.

Consider how this works in practice. When AlphaZero's millions of parameters get compressed into concept vectors, the compression destroys most information. The resulting explanation operates within what we might call “bounded factivity” — it's only true within specific limits and contexts. The explanation isn't universally correct; it's correct enough, often enough, to be useful.

Diagram for Bounded Truth

This mirrors how science has always progressed. Newtonian mechanics is bounded — it works at certain scales, fails at others. The ideal gas law assumes molecules have no volume — obviously false, yet useful within bounds. These aren't compromises; they're features. By accepting bounded truth instead of demanding complete fidelity, we enable discovery.

The AlphaZero study demonstrates this perfectly. The extracted concepts weren't faithful to the AI's computations, but they successfully mediated between machine intelligence and human understanding. The bounds were clear: these patterns apply to specific chess positions under particular conditions. Within those bounds, grandmasters learned something real.

Embracing Broken Tools

We're entering an era where AI will be involved in most scientific discoveries. Models are finding patterns in everything from protein folding to climate dynamics that humans would never notice. We need ways to extract insights from these systems.

The crisis isn't that explanation methods are imperfect. It's that we expected perfection from a task that's fundamentally about translating between incompatible forms of understanding. Every scientific instrument makes tradeoffs. Telescopes can't see radio waves. Microscopes kill living samples. Explanation methods sacrifice computational fidelity for human insight.

The AlphaZero study proves this isn't just philosophy — it's practical reality. Broken extraction methods helped world champions improve at a game humanity has studied for centuries. If imperfect methods can advance human knowledge in chess, imagine what they might reveal in genomics, materials science, or medicine.

“In Defence of Post-hoc Explainability” argues that computational faithfulness matters less than scientific usefulness. When explanation methods fail mathematically but succeed scientifically, we should care more about the science. Read the full paper for the complete framework.

Built with pragmatic incorrectness at socius: Experimental Intelligence Lab.