Workshop
NeurIPS 2024

Every major AI lab studying scientific problems faces the same uncomfortable truth. They use explanation methods to understand what their models discovered. They publish visualisations showing which features matter most. But mathematically, provably, these methods don't actually explain what the AI computed.
Recent proofs demonstrate these explanation tools fail catastrophically. They perform worse than random guessing at identifying what neural networks actually use. Run the same method twice with slightly different settings, get completely different answers. Design adversarial examples that fool them trivially.
The mathematical proofs are unambiguous. These methods don't do what they claim. And yet, researchers keep using them to generate new findings — suggesting that either we’re missing something about their hidden value, or that our standards for discovery aren’t what we think they are.
Critics evaluate explanation methods by asking whether they accurately represent neural network computations. When they don't — and they rarely do — the methods are declared broken.
But scientists aren't trying to understand neural networks. They're trying to understand reality. The neural network is just a sophisticated pattern detector that notices things humans miss.
Consider three nested functions.
A recent study from University of Oxford, Google DeepMind and Goodfire AI demonstrates that even when explanation methods misrepresent a model's computations, they can still reveal novel insights about the world. Researchers extracted chess concepts from AlphaZero — an AI that mastered chess through self-play, without human knowledge. Their extraction method used convex optimisation to find concept vectors in AlphaZero's neural representations.
These vectors don't precisely represent AlphaZero's actual computations. The extraction process compresses millions of parameters into sparse vectors through lossy approximations. By the critics' standards, this method fails.
Yet when these “broken” concept vectors were used to generate training puzzles for human experts — including world champions Vladimir Kramnik and Hou Yifan — something unexpected happened. One grandmaster improved the performance by up to 42% after studying positions that exemplified these extracted concepts.
One concept involved quiet moves that improve piece positioning while preparing unconventional long-term sacrifices. Grandmaster commentary described these as “clever”, “not natural”, and containing ideas that were “hard to spot” even after seeing the solution. The concepts combined known chess principles in novel ways that violated conventional strategic thinking.
The extraction method failed to faithfully represent AlphaZero's computations. But it succeeded in identifying patterns that expanded human chess knowledge at the highest level.
This phenomenon — where broken explanations generate valid insights — isn't a fluke. It reflects how knowledge transfers between incompatible forms of intelligence through what we call “mediated understanding”.
The explanation methods don't directly reveal what neural networks compute. Instead, they mediate between two incompatible systems: AI that processes information in thousands of dimensions, and humans who think in causes and stories. This mediation is necessarily lossy, approximate, often wrong. But it creates a bridge where none should exist.
Consider how this works in practice. When AlphaZero's millions of parameters get compressed into concept vectors, the compression destroys most information. The resulting explanation operates within what we might call “bounded factivity” — it's only true within specific limits and contexts. The explanation isn't universally correct; it's correct enough, often enough, to be useful.

This mirrors how science has always progressed. Newtonian mechanics is bounded — it works at certain scales, fails at others. The ideal gas law assumes molecules have no volume — obviously false, yet useful within bounds. These aren't compromises; they're features. By accepting bounded truth instead of demanding complete fidelity, we enable discovery.
The AlphaZero study demonstrates this perfectly. The extracted concepts weren't faithful to the AI's computations, but they successfully mediated between machine intelligence and human understanding. The bounds were clear: these patterns apply to specific chess positions under particular conditions. Within those bounds, grandmasters learned something real.
We're entering an era where AI will be involved in most scientific discoveries. Models are finding patterns in everything from protein folding to climate dynamics that humans would never notice. We need ways to extract insights from these systems.
The crisis isn't that explanation methods are imperfect. It's that we expected perfection from a task that's fundamentally about translating between incompatible forms of understanding. Every scientific instrument makes tradeoffs. Telescopes can't see radio waves. Microscopes kill living samples. Explanation methods sacrifice computational fidelity for human insight.
The AlphaZero study proves this isn't just philosophy — it's practical reality. Broken extraction methods helped world champions improve at a game humanity has studied for centuries. If imperfect methods can advance human knowledge in chess, imagine what they might reveal in genomics, materials science, or medicine.
“In Defence of Post-hoc Explainability” argues that computational faithfulness matters less than scientific usefulness. When explanation methods fail mathematically but succeed scientifically, we should care more about the science. Read the full paper for the complete framework.
Built with pragmatic incorrectness at socius: Experimental Intelligence Lab.

