https://x.com/mushoku_swe/status/2008094896754974913 - connection between statistics and geometry
1. Why “variance of the score” is such a deep statement
Recall the score function:
sθ(x)=∇θlogp(x∣θ)This tells you: how much would I want to change the parameter if I saw this data point?
Now, two key facts (under regularity conditions):
This is already remarkable: information is not about the size of the gradient, but about how much it fluctuates.
2. Intuition: why variance = information?
Think about two extremes:
Case A: Flat or uninformative model
If changing θ barely affects the likelihood, then:
-
The score is close to zero
-
Different samples produce almost the same score
-
Low variance ⇒ low Fisher information
You can’t really tell where θ is.
Case B: Sensitive model
If small changes in θ strongly affect likelihood:
-
Different samples pull θ in noticeably different directions
-
The score varies a lot
-
High variance ⇒ high Fisher information
The data “pushes back” strongly when θ is wrong.
So information is literally:
How violently does the model react to parameter changes across samples?
3. Why this becomes geometry (not just statistics)
Here’s the leap that information geometry makes:
Instead of thinking of θ as a point in ℝⁿ, think of each θ as a probability distribution.
Now ask:
How different are p(x∣θ) and p(x∣θ+dθ)?
The answer (to second order) is:
KL(pθ∥pθ+dθ)≈21dθ⊤I(θ)dθThat means:
-
Fisher information is the local quadratic form
-
It defines an inner product
-
Which defines a Riemannian metric
So curvature isn’t metaphorical—it’s literal curvature of the statistical manifold.
4. Score variance = curvature (intuitively)
Another way to say it:
-
The score is a tangent vector on the manifold of distributions
-
Its variance tells you how “spread out” these tangent vectors are
-
High spread ⇒ sharp curvature
-
Low spread ⇒ flat geometry
Flat geometry = parameters are hard to distinguish
Curved geometry = parameters are sharply identifiable
This is why:
-
Natural gradient rescales by I−1
-
It moves along geodesics, not raw parameter space
5. statistics isn’t just algebra, but geometry.
What usually comes after this realization is:
-
Understanding KL divergence as distance-like
-
Seeing MLE as projection
-
Seeing exponential families as flat manifolds
-
Seeing why Euclidean intuition fails for probability spaces
-
Walk through a 1D Gaussian example where Fisher info literally equals curvature
-
Explain why exponential families are special geometrically
-
Connect this to deep learning and natural gradients
-
Or translate this intuition into pure geometry language
No comments:
Post a Comment