The Exponential Family: Getting Weird Expectations!
I spent quite some time delving into the beauty of variational inference in the recent month. I did not realize how simple and convenient it is to derive the expectations of various forms (e.g. logarithm) of random variables under variational distributions until I finally got to understand (partially, ) how to make use of properties of the exponential family.
In the following, I would like to share just enough knowledge of exponential family, which can help us obtain weird expectations that we may encounter in probabilistic graphical models. For complete coverage, I list some online materials that’s very helpful when I started to learn about these things.
Definition
We say a random variable $x$ follows an exponential family of distribution if the probabilistic density function can be written in the following form:
where:
- $\eta$ is a vector of parameters
- $T(x)$ is the sufficient statistics
- $A(\eta)$ is the cumulant function
The presence of $T(x)$ and $A(\eta)$ actually confuses me a lot when I first start to read materials on exponential family of distributions. It turns out that they can be understood by getting a little bit deeper into this special formulation of density functions.
Fanscinating properties
We start by focusing on the property of $A(\eta)$. According to the definition of probabilities, the integral of density function in Equation \eqref{eq:pdf} over $x$ equals to 1:
From the equation above, we can tell that $A(\eta)$ can be viewed as the logarithm of the normalizer. Its value depends on $T(x)$ and $h(x)$.
An interesting thing happens now - if we take the derivative of Equation \eqref{eq:a_eta} with respect to $\eta$, we have:
How neat it is! It turns out the first derivative of the cumulant function $A(\eta)$ is actually the expectation of the sufficient statistic $T(x)$! If we go further - take the second derivative - we will get the variance of $T(x)$:
In fact, we can go deeper and deeper to generate higher order moment of $T(x)$.
Examples
Let’s end by taking a look at some familar probability distributions that belong to the exponential family.
Dirichlet distribution
Suppose a random variable $\theta$ is drawn from a Dirichlet distribution parameterized by $\alpha$, we have the following density function:
This simple transformation turns Dirichlet density function into the form of exponential family, where:
- $\eta = \alpha - 1$
- $A(\eta) = \sum_k log~\Gamma(\alpha_k) - log~\Gamma(\sum_k \alpha_k)$
- $T(\theta) = log~\theta$
Such information is helpful when we want to compute the expectation of the log of a random variable that follows a Dirichlet distribution (e.g., this happens in the derivation of latent Dirichlet allocation with variational inference.)
where $\Psi(\cdot)$ is the first derivative of log gamma. It is called the digamma function.
Gamma distribution
Gamma distribution has two parameters: $\alpha$ that controls the shape and $\beta$ that controls the scale. Therefore, we can find two sets of $A(\eta)$ and $T(x)$. Suppose $x \sim Gamma(\alpha, \beta)$, we have:
Although it is obvious that $A(\eta) = log~\Gamma(\alpha) - \alpha log~\beta$ for both situations, the natural parameter $\eta$ is different. The first transformation helps us get the expectation of $x$ itself:
Similarly, the second one helps get the expectation of $log~x$
Conclusion
In fact, there are many more to explore and know about the exponential family! Important concepts such as convexity and sufficency are not discussed here. Finally, I would recommend the following excellent materials for getting to know this cool concept of unifying a set of probabilistic distributions:
- Chapter 4.2.4 and 10.4 of Pattern Recognition and Machine Learning
- http://www.cs.columbia.edu/~jebara/4771/tutorials/lecture12.pdf
- https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf