Monday, January 05, 2026

Fisher information

 https://x.com/mushoku_swe/status/2008094896754974913 - connection between statistics and geometry

1. Why “variance of the score” is such a deep statement

Recall the score function:

sθ(x)=θlogp(xθ)s_\theta(x) = \nabla_\theta \log p(x \mid \theta)

This tells you: how much would I want to change the parameter if I saw this data point?

Now, two key facts (under regularity conditions):

E[sθ(X)]=0\mathbb{E}[s_\theta(X)] = 0
I(θ)=E[sθ(X)sθ(X)]=Var(sθ(X))\mathcal{I}(\theta) = \mathbb{E}[s_\theta(X) s_\theta(X)^\top] = \mathrm{Var}(s_\theta(X))

This is already remarkable: information is not about the size of the gradient, but about how much it fluctuates.


2. Intuition: why variance = information?

Think about two extremes:

Case A: Flat or uninformative model

If changing θ barely affects the likelihood, then:

  • The score is close to zero

  • Different samples produce almost the same score

  • Low variance ⇒ low Fisher information

You can’t really tell where θ is.

Case B: Sensitive model

If small changes in θ strongly affect likelihood:

  • Different samples pull θ in noticeably different directions

  • The score varies a lot

  • High variance ⇒ high Fisher information

The data “pushes back” strongly when θ is wrong.

So information is literally:

How violently does the model react to parameter changes across samples?


3. Why this becomes geometry (not just statistics)

Here’s the leap that information geometry makes:

Instead of thinking of θ as a point in ℝⁿ, think of each θ as a probability distribution.

Now ask:

How different are p(xθ)p(x \mid \theta) and p(xθ+dθ)p(x \mid \theta + d\theta)?

The answer (to second order) is:

KL(pθpθ+dθ)    12dθI(θ)dθ\mathrm{KL}(p_\theta \| p_{\theta + d\theta}) \;\approx\; \frac{1}{2} d\theta^\top \mathcal{I}(\theta) d\theta

That means:

  • Fisher information is the local quadratic form

  • It defines an inner product

  • Which defines a Riemannian metric

So curvature isn’t metaphorical—it’s literal curvature of the statistical manifold.


4. Score variance = curvature (intuitively)

Another way to say it:

  • The score is a tangent vector on the manifold of distributions

  • Its variance tells you how “spread out” these tangent vectors are

  • High spread ⇒ sharp curvature

  • Low spread ⇒ flat geometry

Flat geometry = parameters are hard to distinguish
Curved geometry = parameters are sharply identifiable

This is why:

  • Natural gradient rescales by I1\mathcal{I}^{-1}

  • It moves along geodesics, not raw parameter space


5.  statistics isn’t just algebra, but geometry.

What usually comes after this realization is:

  • Understanding KL divergence as distance-like

  • Seeing MLE as projection

  • Seeing exponential families as flat manifolds

  • Seeing why Euclidean intuition fails for probability spaces

  • Walk through a 1D Gaussian example where Fisher info literally equals curvature

  • Explain why exponential families are special geometrically

  • Connect this to deep learning and natural gradients

  • Or translate this intuition into pure geometry language

No comments: