EM Algorithm for Mixture Modeling

Deriving Expectation-Maximization for Gaussian mixture models

Author

Soma S. Dhavala

EM Algorithm for Mixture Modeling

This note derives the Expectation-Maximization (EM) algorithm for Gaussian mixture models (GMMs). A broad class of random spaces can be approximated by a weighted linear combination of normal densities. Such a density can be written as:

\[ f(\phi; y) = \sum_{k=1}^{K} \lambda_k f_k(\theta_k; y) \tag{1} \]

where

\(K\) is the total number of components
\(y\) is the data vector
\(\phi = [\lambda_1 \ldots \lambda_K, \theta_1 \ldots \theta_K]\) and \(\theta_k = [\mu_k, \Sigma_k]\) represent the mean vector and covariance matrix of the \(k\)th density
\(f_k \sim N(\mu_k, \Sigma_k)\)
\(\{\lambda_k\}\) represent the mixing weights with constraint \(\sum_{k=1}^{K} \lambda_k = 1\)

From this point onward, vector and matrix notation is simplified for readability. To derive the EM algorithm, we first identify the missing information. This choice matters because different definitions of missing information can lead to different convergence behavior.

In mixture modeling, the missing information is the component label. If the label of each observation were known, meaning we knew which density produced each observation, estimating the mixture parameters would be straightforward: group observations by component and estimate each component separately. We therefore choose the labels as the missing information.

Once we identify the missing information, we need to define the complete data likelihood and marginal unobserved data density conditional on the observed data. A very good explanation of the EM algorithm is given in [1, pp. 66-87].

Let \(u(n) \in \{1 \ldots K\}\) be the missing label corresponding to the \(n\)th observation \(y(n)\). The marginal density (in fact pmf) of this unobserved data is then given as:

\[ p(u(n) = k) = \lambda_k \tag{2} \]

The density of the observed data given the unobserved data is:

\[ p(y(n) \mid u(n)) = f_{u(n)}(y(n); \theta_{u(n)}) \tag{3} \]

Then we obtain the complete data density as:

\[ p(y(n), u(n); \phi) = f_{u(n)}(y(n); \theta_{u(n)}) \lambda_{u(n)} \tag{4} \]

The complete data density of \(N\) independent observations is then:

\[ p(y, u; \phi) = \prod_{n=1}^{N} p(y(n), u(n); \phi) \tag{5} \]

Marginal unobserved data density given the observed data is:

\[ p(u \mid y; \phi) = p(y, u; \phi) / p(y; \phi) \tag{6} \]

where \(p(y; \phi)\) is the incomplete or observed data density (likelihood function of the observed data).

The EM algorithm proceeds as follows:

E-Step: Given an initial estimate of the parameter \(\hat{\phi}\), we calculate the expectation of the complete log-likelihood. The expectation is taken over the marginal density of the unobserved data given the observed data. We use \(E_{u \mid y}\) to emphasize the distribution over which the expectation is taken.
M-step: We maximize this function w.r.t \(\phi\) to obtain an improved (hopefully) estimate of \(\phi\)

E-Step

Since we have independent observations

\[ E_{u \mid y}\left[\ln p(u, y; \phi)\right] = \sum_{n=1}^{N} E_{u(n) \mid y(n)}\left[\ln p(u(n), y(n); \theta_{u(n)})\right] \tag{7} \]

Upon expanding the expectation operator and letting \(u(n) = k\),

\[ E_{k \mid y(n)}\left[\ln p(k, y(n); \theta_k)\right] = \sum_{k=1}^{K} \left[\ln p(k, y(n); \theta_k)\right] \frac{p(y(n), k; \hat{\theta}_k)}{p(y(n); \hat{\phi})} \tag{8} \]

The dependence of \(u(n)\) on \(k\) becomes obvious if we recollect that \(u(n) \in \{1 \ldots K\}\). Also, the denominator is only a function of the data and hence can be ignored while performing the M-step. Hence, we restate the above equation as:

\[ E_{k \mid y(n)}\left[\ln p(k, y(n); \theta_k)\right] \propto \sum_{k=1}^{K} \left[\ln p(k, y(n); \theta_k)\right] p(y(n), k; \hat{\theta}_k) \tag{9} \]

Then,

\[ E_{u \mid y}\left[\ln p(u, y; \phi)\right] \propto \sum_{n=1}^{N} \sum_{k=1}^{K} \hat{\lambda}_k f_k(y(n); \hat{\theta}_k) \left[\ln \lambda_k - \frac{1}{2} \exp\left[(y(n) - \mu_k)^T \Sigma_k^{-1} (y(n) - \mu_k)\right] - \ln |\Sigma_k|\right] \tag{10} \]

The above function has to be maximized subject to the constraint \(\sum_{k=1}^{K} \lambda_k = 1\) and define it as \(Q(\phi; \hat{\phi})\).

M-Step

\[ \hat{\phi}_{new} = \arg\max_{\phi} Q(\phi; \hat{\phi}) \tag{11} \]

Let \(\varphi\) be the Lagrange multiplier. The function to be maximized is then:

\[ U(\phi; \hat{\phi}) = Q(\phi; \hat{\phi}) - \varphi \left(\sum_{k=1}^{K} \lambda_k - 1\right) \tag{12} \]

To obtain \(\hat{\lambda}_{k,new}\), we set the derivative of \(U(\phi; \hat{\phi})\) w.r.t \(\lambda_k\) to zero and solve for the roots and similarly for \(\mu_k\) and \(\Sigma_k\). After some straightforward algebraic manipulations, we obtain the expressions for the estimates as [2, Chapter 5]:

\[ \hat{\lambda}_{k,new} = \frac{1}{N} \sum_{n=1}^{N} \frac{\hat{\lambda}_k f_k(y(n); \hat{\theta}_k)}{f(y(n); \hat{\phi})} \tag{13} \]

\[ \hat{\mu}_{k,new} = \frac{1}{N \hat{\lambda}_{k,new}} \sum_{n=1}^{N} \frac{\hat{\lambda}_k f_k(y(n); \hat{\theta}_k)}{f(y(n); \hat{\phi})} y(n) \tag{14} \]

\[ \hat{\Sigma}_{k,new} = \frac{1}{N \hat{\lambda}_{k,new}} \sum_{n=1}^{N} \frac{\hat{\lambda}_k f_k(y(n); \hat{\theta}_k)}{f(y(n); \hat{\phi})} \left[(y(n) - \hat{\mu}_k)^T (y(n) - \hat{\mu}_k)\right] \tag{15} \]

References

[1] Aleksander Dogandžić, “Detection and estimation theory,” Spring, 2005, class notes #3, ECpE Dept., Iowa State University, url: home.eng.iastate.edu/~ald/EE527.html.

[2] Soma S. Dhavala, Time-frequency representations: Analysis, Synthesis and Implementation, M.S. thesis, Indian Institute of Technology, Madras, March 2000.