# The Exponential Family: Getting Weird Expectations!

I spent quite some time delving into the beauty of variational inference in the recent month. I did not realize how simple and convenient it is to derive the expectations of various forms (e.g. logarithm) of random variables under variational distributions until I finally got to understand (partially, ) how to make use of properties of the exponential family.

In the following, I would like to share ** just enough** knowledge of exponential family, which can help us obtain weird expectations that we may encounter in probabilistic graphical models. For complete coverage, I list some online materials thatâ€™s very helpful when I started to learn about these things.

### Definition

We say a random variable $x$ follows an exponential family of distribution if the probabilistic density function can be written in the following form:

where:

- $\eta$ is a vector of parameters
- $T(x)$ is the
*sufficient statistics* - $A(\eta)$ is the
*cumulant function*

The presence of $T(x)$ and $A(\eta)$ actually confuses me a lot when I first start to read materials on exponential family of distributions. It turns out that they can be understood by getting a little bit deeper into this special formulation of density functions.

### Fanscinating properties

We start by focusing on the property of $A(\eta)$. According to the definition of probabilities, the integral of density function in Equation \eqref{eq:pdf} over $x$ equals to 1:

From the equation above, we can tell that $A(\eta)$ can be viewed as the logarithm of the normalizer. Its value depends on $T(x)$ and $h(x)$.

An interesting thing happens now - if we take the derivative of Equation \eqref{eq:a_eta} with respect to $\eta$, we have:

How neat it is! It turns out the first derivative of the cumulant function $A(\eta)$ is actually the expectation of the sufficient statistic $T(x)$! If we go further - take the second derivative - we will get the variance of $T(x)$:

In fact, we can go deeper and deeper to generate higher order moment of $T(x)$.

### Examples

Letâ€™s end by taking a look at some familar probability distributions that belong to the exponential family.

#### Dirichlet distribution

Suppose a random variable $\theta$ is drawn from a Dirichlet distribution parameterized by $\alpha$, we have the following density function:

This simple transformation turns Dirichlet density function into the form of exponential family, where:

- $\eta = \alpha - 1$
- $A(\eta) = \sum_k log~\Gamma(\alpha_k) - log~\Gamma(\sum_k \alpha_k)$
- $T(\theta) = log~\theta$

Such information is helpful when we want to compute the expectation of the log of a random variable that follows a Dirichlet distribution (e.g., this happens in the derivation of latent Dirichlet allocation with variational inference.)

where $\Psi(\cdot)$ is the first derivative of log gamma. It is called the digamma function.

#### Gamma distribution

Gamma distribution has two parameters: $\alpha$ that controls the shape and $\beta$ that controls the scale. Therefore, we can find two sets of $A(\eta)$ and $T(x)$. Suppose $x \sim Gamma(\alpha, \beta)$, we have:

Although it is obvious that $A(\eta) = log~\Gamma(\alpha) - \alpha log~\beta$ for both situations, the natural parameter $\eta$ is different. The first transformation helps us get the expectation of $x$ itself:

Similarly, the second one helps get the expectation of $log~x$

### Conclusion

In fact, there are many more to explore and know about the exponential family! Important concepts such as convexity and sufficency are not discussed here. Finally, I would recommend the following excellent materials for getting to know this cool concept of unifying a set of probabilistic distributions:

- Chapter 4.2.4 and 10.4 of Pattern Recognition and Machine Learning
- http://www.cs.columbia.edu/~jebara/4771/tutorials/lecture12.pdf
- https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf