Gaussian Mixture for Multimodal Posteriors

Approximating a multi-modal posterior by a mixture of normals

Author

Soma S. Dhavala

Gaussian Mixture for Multimodal Posteriors

Sometimes we need to draw samples from a posterior density with more than one mode. In such cases, a single normal approximation may be inadequate. A more flexible approach is to approximate the posterior as a mixture of normal densities. The number of modes, and therefore the number of mixture components, is assumed unknown.

The mixture model in [1] estimates model parameters from data. In the present setting, the posterior density is available in semi-closed form, so the problem becomes a special case of that framework.

Let \(f(\theta)\) be the known arbitrary PDF which we want to approximate. For illustration purposes, we assume a bivariate density (\(\theta = [\theta_1, \theta_2]^T\)) and discretize \(\theta\) on a uniform \(N \times M\) grid. Further let \(\theta_{mn}\) represent the \((\theta_1(n), \theta_2(m))\) sample in the grid. Also assume that the discretized PDF (\(f(\theta_{mn})\)) is appropriately normalized.

\[ f(\theta_{mn}; \Phi) = \sum_{k=1}^{K} \lambda_k f_k(\theta_{mn}; \phi_k) \tag{1} \]

where

\(K\) is the total number of components (unknown)
\(\phi_k = [\mu_k, \Sigma_k]\) represent the mean vector and covariance matrix of the \(k\)th density
\(f_k \sim \mathcal{N}(\mu_k, \Sigma_k)\)
\(\{\lambda_k\}\) represent the mixing weights with constraint \(\sum_{k=1}^{K} \lambda_k = 1\)
\(\Phi = [\phi_1 \ldots \phi_K, \lambda_1 \ldots \lambda_K]\)

Then we obtain the estimates for the mixture model parameters as:

\[ \hat{\lambda}_{k,\text{new}} = \sum_{n=1}^{N} \sum_{m=1}^{M} \gamma_k(\theta_{mn}; \hat{\Phi}) \tag{2} \]

\[ \hat{\mu}_{k,\text{new}} = \frac{1}{\hat{\lambda}_{k,\text{new}}} \sum_{n=1}^{N} \sum_{m=1}^{M} \gamma_k(\theta_{mn}; \hat{\Phi}) \theta_{mn} \tag{3} \]

\[ \hat{\Sigma}_{k,\text{new}} = \frac{1}{\hat{\lambda}_{k,\text{new}}} \sum_{n=1}^{N} \sum_{m=1}^{M} \gamma_k(\theta_{mn}; \hat{\Phi}) (\theta_{mn} - \hat{\mu}_k)^T (\theta_{mn} - \hat{\mu}_k) \tag{4} \]

where

\[ \gamma_k(\theta_{mn}; \hat{\Phi}) = \frac{\hat{\lambda}_k f_k(\theta_{mn}; \hat{\theta}_k) f(\theta_{mn}; \Phi)}{f(\theta_{mn}; \hat{\Phi})} \tag{5} \]

The equations above require the number of components \(K\), which is unknown. To address this, we deliberately overestimate \(K\), estimate the mixture, and then repeatedly merge nearby components. Each merge replaces two normal components with a new normal component chosen to minimize the distance to the merged pair, subject to a threshold. In effect, we overfit the posterior first and then prune the mixture until no further merging is warranted. Different distance measures lead to different merging strategies; see [2, 3] for discussion.

Now it becomes a trivial task to generate samples from the posterior. Instead of drawing the samples from \(p(\theta)\), we draw from the mixture. The algorithm is outlined below:

\(u \sim U[0, 1]\)
choose \(k\)th component if \(\sum_{i=0}^{k-1} \lambda_i \leq u < \sum_{i=0}^{k} \lambda_i\)
generate a multivariate random number from \(f_k(\mu_k, \Sigma_k)\)

Essentially, we have \(K\) number of states and each state has a certain pdf. Based on this relationship, we can draw random numbers quite efficiently.

References

[1] Soma S. Dhavala, “EM algorithm for Gaussian mixture modeling,” report, ECpE Dept., Iowa State University.

[2] Mark Coates, Time-frequency modeling, Ph. D, Univ. of Cambridge, Signal Processing and Communications Laboratory, Dept. of Engineering, August 1998.

[3] M. Figueiredo, J. Leitao and A. K. Jain, “On fitting mixture models,” in Energy Minimization Methods in Computer Vision and Pattern Recognition, vol. 1654, pp. 54–69. Springer-Verlag, 1999.