Adhyayan: Fisher information

Monday, January 05, 2026

Fisher information

https://x.com/mushoku_swe/status/2008094896754974913 - connection between statistics and geometry

1. Why “variance of the score” is such a deep statement

Recall the score function:

s_\theta(x) = \nabla_\theta \log p(x \mid \theta)

This tells you: how much would I want to change the parameter if I saw this data point?

Now, two key facts (under regularity conditions):

\mathbb{E}[s_\theta(X)] = 0

\mathcal{I}(\theta) = \mathbb{E}[s_\theta(X) s_\theta(X)^\top] = \mathrm{Var}(s_\theta(X))

This is already remarkable: information is not about the size of the gradient, but about how much it fluctuates.

2. Intuition: why variance = information?

Think about two extremes:

Case A: Flat or uninformative model

If changing θ barely affects the likelihood, then:

The score is close to zero
Different samples produce almost the same score
Low variance ⇒ low Fisher information

You can’t really tell where θ is.

Case B: Sensitive model

If small changes in θ strongly affect likelihood:

Different samples pull θ in noticeably different directions
The score varies a lot
High variance ⇒ high Fisher information

The data “pushes back” strongly when θ is wrong.

So information is literally:

How violently does the model react to parameter changes across samples?

3. Why this becomes geometry (not just statistics)

Here’s the leap that information geometry makes:

Instead of thinking of θ as a point in ℝⁿ, think of each θ as a probability distribution.

Now ask:

How different are $p(x \mid \theta)$ and $p(x \mid \theta + d\theta)$ ?

The answer (to second order) is:

\mathrm{KL}(p_\theta \| p_{\theta + d\theta}) \;\approx\; \frac{1}{2} d\theta^\top \mathcal{I}(\theta) d\theta

That means:

Fisher information is the local quadratic form
It defines an inner product
Which defines a Riemannian metric

So curvature isn’t metaphorical—it’s literal curvature of the statistical manifold.

4. Score variance = curvature (intuitively)

Another way to say it:

The score is a tangent vector on the manifold of distributions
Its variance tells you how “spread out” these tangent vectors are
High spread ⇒ sharp curvature
Low spread ⇒ flat geometry

Flat geometry = parameters are hard to distinguish
Curved geometry = parameters are sharply identifiable

This is why:

Natural gradient rescales by $\mathcal{I}^{-1}$
It moves along geodesics, not raw parameter space

5. statistics isn’t just algebra, but geometry.

What usually comes after this realization is:

Understanding KL divergence as distance-like
Seeing MLE as projection
Seeing exponential families as flat manifolds
Seeing why Euclidean intuition fails for probability spaces

Walk through a 1D Gaussian example where Fisher info literally equals curvature
Explain why exponential families are special geometrically
Connect this to deep learning and natural gradients
Or translate this intuition into pure geometry language

Adhyayan