Conjugate distributions to the model


In-depth Articles

Conjugate distributions to the model

Given a model \(\{S_x \space ; \space \Psi (\underline x | \theta ) \space ; \space S_\theta = \Theta \}\) then the parametric class \(D\) of distribution for \(\theta\) is said to be conjugate to the model if, choosing the prior in \(D\), the posterior also belongs to it for every value of \(\underline{x}\).

In formal terms it is said that:

\(\underline{x} \sim f(\underline{x}; \theta)\) with N i.i.d. trials

we have an induced model:

\[ \{S_x \space ; \space \Psi (\underline x | \theta ) \space ; \space S_\theta = \Theta \} \]

If \(f(\underline{x}; \theta)\) belongs to the exponential family:

\[ f(\underline{x} | \theta) = D(\underline{x}) \space \exp{ \{b\space(\theta)\space \space g\space(\underline{x}) - c \space( \theta) \} } \]

Then the prior will have a "standard" density function (belonging to the exponential family: e.g. Normal, Gamma, Poisson, ...):

\[ \pi(\theta) \propto \exp{ \{\eta_{\space 1} \space \space b \space ( \theta) - \eta_{\space 2} \space \space c \space ( \theta) \}} \space | \partial \space b \space ( \theta) / \partial \space \theta | \sim_{the \space kernel} standard \space density \space function \] Then if the posterior will also have a "standard" density function equal to that of the prior but with different parameters:

\[\pi(\underline{x} | \theta) = \frac{ f(\underline{x} | \theta) \pi(\theta)}{\int_{\Theta}{f(\underline{x} | \theta) \pi(\theta)}} \propto \space f(\underline{x} | \theta) \pi(\theta) \sim_{the \space kernel} standard \space density \space function\]

We have thus found the conjugate distribution to the model.

Some of the most common conjugate distributions:

Distribution name Base model \(\space f(\underline{x} | \theta)\) Conjugate class \(\pi(\theta)\) Hyperparameter update \(\pi(\underline{x} | \theta)\)
Uniform \(Uniform(0 , \theta)\) \(Pareto(\alpha, \beta)\) \(Pareto(\alpha+ n, max\{\beta, x_{(n)} \})\)
Bernoulli \(\space Be(\theta)\) \(Beta(\alpha, \beta)\) \(Beta(\alpha + \sum x_i, \beta + n - \sum x_i)\)
Poisson \(Pois(\alpha , \beta)\) \(Gamma(\alpha , \beta)\) \(Gamma(\alpha + \sum x_i , \beta + n)\)
Exponential \(Exp(\theta)\) \(Gamma(\alpha , \beta)\) \(Gamma(\alpha + n , \beta + \sum x_i )\)
Exponential \(Exp(\frac{1}{\theta})\) \(GammaInv(\alpha , \beta)\) \(GammaInv(\alpha + n , \beta + \sum x_i )\)
Normal \(Normal( \mu , \sigma^2 = KNOWN)\) \(Normal( \mu_0, \sigma^2_0)\) \(Normal( \frac{\mu_0 + \sum x_i \space \sigma^2_0}{ \sigma^2_0 \space n + \sigma^2}, \frac{\sigma^2 \space \sigma^2_0}{ \sigma^2_0 \space n + \sigma^2})\)
Normal \(Normal( \mu = KNOWN , \sigma^2)\) \(GammaInv(\alpha , \beta)\) \(GammaInv(\alpha + \frac{n}{2} , \beta + \frac{\sum (x_i - \mu)^2}{2})\)

Let's see some proofs of these relationships:

Easy example: Bernoulli-Beta Model

\(X \sim Be(\theta)\) with n independent trials where: \[ f(x; \theta) = \theta^x \space ( 1 - \theta)^x \] So the induced model becomes:

\[\big\{ x=\{0,1\}^{(n)} \space \space ; \space \space \theta^{\sum_{i=1}^{n}x_i} \space (1-\theta)^{n-\sum_{i=1}^{n}x_i} \space \space ; \space \space \Theta = (0,1) \big\}\]

The maximum likelihood estimate is:

\[MLE =\hat\theta = \frac{\sum_{i=1}^{n}x_i}{n}\]

And the prior connected to this model is a \(Beta(\alpha,\beta)\)

Proof

First we must rewrite the Bernoulli density function so that we can recognize the different components of the exponential family formula:

\[f(x; \theta) = \theta^x \space ( 1 - \theta)^x \\ f(x; \theta) = \exp{\{\log(\theta^x \space ( 1 - \theta)^x)\}} \\ f(x; \theta) = \exp{\{x \space \log(\frac{\theta}{ 1 - \theta}) + \space \log( 1 - \theta)\}}\]


Review on the exponential family:

\[X \sim Exponential \space family \\ \\ f(x; \theta) = D\space(x) \space \space \exp{\{b\space(\theta)\space \space g\space(x) + \space c\space(\theta)\}}\]


Then the conjugate prior to the distribution results in this case:

\[\pi(\theta) \propto \exp{ \{\eta_{\space 1} \space \space b \space ( \theta) - \eta_{\space 2} \space \space c \space ( \theta) \}} \space | \partial \space b \space ( \theta) / \partial \space \theta | \sim_{the \space kernel} Beta \space density \space function\]

Indeed:

\[b(\theta) = \log \bigg( \frac{\theta}{1-\theta} \bigg) \space\space\space\space\space (i.e. \space the \space logit \space of \space \theta)\\ c(\theta) = - \log(1 - \theta) \\ \bigg{|} \frac{\partial \space b \space ( \theta) }{ \partial \space \theta } \bigg{|}= \frac{1}{\theta \space (1-\theta)}\]

We create the conjugate prior:

\[\pi(\theta) \propto \exp{ \{\eta_{\space 1} \space \space b \space ( \theta) - \eta_{\space 2} \space \space c \space ( \theta) \}} \space | \partial \space b \space ( \theta) / \partial \space \theta | = \\ = \exp{ \{\eta_{\space 1} \space \space \log \bigg( \frac{\theta}{1-\theta} \bigg) + \eta_{\space 2} \space \space \log(1 - \theta) \}} \space \frac{1}{\theta \space (1-\theta)} = \\ = \exp{ \{ \space \space \log \bigg( \frac{\theta}{1-\theta} \bigg)^{\eta_{\space 1}} \}} \space \space \exp{ \{ \space \space \log(1 - \theta)^{\eta_{\space 2}} \}} \space \frac{1}{\theta \space (1-\theta)} = \\= \bigg( \frac{\theta}{1-\theta} \bigg)^{\eta_{\space 1}} \space \space (1 - \theta)^{\eta_{\space 2}} \space \frac{1}{\theta \space (1-\theta)} = \\ = \theta^{ \space \eta_{\space 1}} \space \space (1-\theta)^{-\eta_{\space 1}} \space \space (1 - \theta)^{\space \eta_{\space 2}} \space \space \theta^ {\space -1} \space \space (1-\theta)^ {\space -1} = \\ = \theta^{ \space \eta_{\space 1} -1} \space \space (1-\theta)^{\eta_{\space 2}-\eta_{\space 1}-1}\]

This is the kernel of a \(Beta(\eta_{\space 1}, \eta_{\space 2}-\eta_{\space 1})\)

Now let's calculate the posterior applying Bayes' theorem seen in the introduction:

\[\pi(\underline{x} | \theta) = \frac{ f(\underline{x} | \theta) \pi(\theta)}{\int_{\Theta}{f(\underline{x} | \theta) \pi(\theta)}} \propto \space f(\underline{x} | \theta) \pi(\theta) = \\= \space \space \theta^{\sum_{i=1}^{n}x_i \space + \alpha -1} \space (1-\theta)^{n-\sum_{i=1}^{n}x_i \space + \beta - 1}\]

This is the kernel of a \(Beta( \alpha+\sum_{i=1}^{n}x_i \space , \space \space n+ \beta-\sum_{i=1}^{n}x_i \space )\)

Therefore, since the probability distribution of the prior has the same functional form as the posterior, then the conjugate distribution to the Bernoulli is the Beta



A slightly more challenging example: Normal-Inverse Gamma Model

\(X \sim N(\mu= m,\theta= \sigma^2)\) (m = known) with n independent trials where: \[ f(x; \theta) = \frac{1}{\sqrt{2 \space \pi \space \sigma^2 }}\space \exp{\bigg\{-\frac{(x-m)^2}{2 \space \sigma^2}\bigg\}} \] So the induced model becomes:

\[\biggl\{ x=R_+^{(n)} \space \space ; \space \space \bigg( \frac{1}{\sqrt{2 \space \pi \space \sigma^2 }}\bigg)^n\space \exp{\bigg\{-\frac{\sum_{i=1}^{n}(x-m)^2}{2 \space \sigma^2}\bigg\}} \space \space ; \space \space \Theta = R_+ \biggl\}\]

The maximum likelihood estimate is:

\[MLE =\hat{\sigma^2} = \frac{\sum_{i=1}^{n}(x_i- \hat\mu)^2}{n}\]

And the prior connected to this model is an \(InverseGamma(\alpha,\beta)\)

Proof

First we must rewrite the Bernoulli density function so that we can recognize the different components of the exponential family formula:

\[f(x; \theta) \propto \frac{1}{\sqrt{\sigma^2 }}\space \exp{\bigg\{-\frac{(x-m)^2}{2 \space \sigma^2}\bigg\}} = \\=\exp{\bigg\{-\frac{(x-m)^2}{2 \space \sigma^2}-\frac{1}{2} \space \log(\sigma^2)\bigg\}} \]


Review on the exponential family:

\[X \sim Exponential \space family \\ \\ f(x; \theta) = D\space(x) \space \space \exp{\{b\space(\theta)\space \space g\space(x) + \space c\space(\theta)\}}\]


Then the conjugate prior to the distribution results in this case:

\[\pi(\theta) \propto \exp{ \{\eta_{\space 1} \space \space b \space ( \theta) - \eta_{\space 2} \space \space c \space ( \theta) \}} \space | \partial \space b \space ( \theta) / \partial \space \theta | \sim_{the \space kernel} Inverse \space Gamma \space density \space function\]

Indeed:

\[b(\sigma^2) = -\frac{1}{2 \space \sigma^2}\\ c(\sigma^2) = \frac{1}{2} \log(\sigma^2) \\ \bigg{|} \frac{\partial \space b \space ( \sigma^2) }{ \partial \space \sigma^2 } \bigg{|}= \frac{1}{(\sigma^2)^2}\]

We create the conjugate prior:

\[\pi(\sigma^2) \propto \exp{ \{\eta_{\space 1} \space \space b \space ( \sigma^2) - \eta_{\space 2} \space \space c \space ( \sigma^2) \}} \space | \partial \space b \space ( \sigma^2) / \partial \space \sigma^2 | = \\ = \exp{ \{-\eta_{\space 1} \space \space \frac{1}{2 \space \sigma^2} - \eta_{\space 2} \space \space \frac{1}{2} \log(\sigma^2) \}} \space \frac{1}{(\sigma^2)^2} = \\ = (\sigma^2)^{- \frac{1}{2} \eta_{\space 2}} \space (\sigma^2)^{-2}\exp{ \{- \space \space \frac{\eta_{\space 1}}{2 \space \sigma^2}\}} = \\ = (\sigma^2)^{-2- \frac{1}{2} \eta_{\space 2}}\exp{ \{- \space \space \frac{\eta_{\space 1}}{2 \space \sigma^2}\}}\]

This is the kernel of an \(InverseGamma(1+ \frac{1}{2} \eta_{\space 2}, \space \space \space \frac{\eta_{\space 1}}{2})\)

Now let's calculate the posterior applying Bayes' theorem seen in the introduction:

\[\pi(\underline{x} | \sigma^2) = \frac{ f(\underline{x} | \sigma^2) \pi(\sigma^2)}{\int_{\sigma^2}{f(\underline{x} | \sigma^2) \pi(\sigma^2)}} \propto \space f(\underline{x} | \sigma^2) \pi(\sigma^2) = \\= \bigg( \frac{1}{\sqrt{2 \space \pi \space \sigma^2 }}\bigg)^n\space \exp{\bigg\{-\frac{\sum_{i=1}^{n}(x-m)^2}{2 \space \sigma^2}\bigg\}} \space \space (\sigma^2)^{-(\alpha +1)} \space \exp{\bigg\{-\frac{\beta}{\sigma^2}\bigg\}} \propto \\ \propto (\sigma^2)^{\frac{n}{2}}\space \exp{\bigg\{-\frac{\sum_{i=1}^{n}(x-m)^2}{2 \space \sigma^2}\bigg\}} \space \space (\sigma^2)^{-(\alpha +1)} \space \exp{\bigg\{-\frac{\beta}{\sigma^2}\bigg\}} = \\ = (\sigma^2)^{(-\alpha - \frac{n}{2} -1)} \space \space \exp{\bigg\{-\frac{\beta +\sum_{i=1}^{n}(x-m)^2}{2 \space \sigma^2}\bigg\}} \]

This is the kernel of an \(InverseGamma \bigg(\alpha + \frac{n}{2}, \space \space \space \beta +\frac{\sum_{i=1}^{n}(x-m)^2}{2} \bigg)\)

Therefore, since the probability distribution of the prior has the same functional form as the posterior, then the conjugate distribution to the Normal with known mean and unknown variance is the Inverse Gamma