# 计算机代写|机器学习代写machine learning代考|COMP3670

EM for PCA has the following advantages over eigenvector methods:

• EM can be faster. In particular, assuming $N, D \gg L$, the dominant cost of EM is the projection operation in the E step, so the overall time is $O(T L N D)$, where $T$ is the number of iterations. [Row97] showed experimentally that the number of iterations is usually very small (the mean was 3.6), regardless of $N$ or $D$. (This result depends on the ratio of eigenvalues of the empirical covariance matrix.) This is much faster than the $O\left(\min \left(N D^2, D N^2\right)\right)$ time required by straightforward eigenvector methods, although more sophisticated eigenvector methods, such as the Lanczos algorithm, have running times comparable to EM.
• EM can be implemented in an online fashion, i.e., we can update our estimate of $\mathbf{W}$ as the data streams in.
• EM can handle missing data in a simple way (see e.g., [IR10; DJ15]).
• EM can be extended to handle mixtures of PPCA/FA models (see Section 20.2.6).
• EM can be modified to variational EM or to variational Bayes EM to fit more complex models (see e.g., Section 20.2.7).

## 计算机代写|机器学习代写machine learning代考|Mixtures of factor analysers

The factor analysis model (Section 20.2) assumes the ohserved data can he modeled as arising from a linear mapping from a low-dimensional set of Gaussian factors. One way to relax this assumption is to assume the model is only locally linear, so the overall model becomes a (weighted) combination of FA models; this is called a mixture of factor analysers. The overall model for the data is a mixture of linear manifolds, which can be used to approximate an overall curved manifold.

More precisely, let latent indicator $c_n \in{1, \ldots, K}$, specifying which subspace (cluster) we should use to generate the data. If $c_n=k$, we sample $\boldsymbol{z}_n$ from a Gaussian prior and pass it through the $\mathbf{W}_k$ matrix and add noise, where $\mathbf{W}_k$ maps from the $L$-dimensional subspace to the $D$-dimensional visible space. ${ }^4$ More precisely, the model is as follows:
\begin{aligned} p\left(\boldsymbol{x}_n \mid \boldsymbol{z}_n, c_n=k, \boldsymbol{\theta}\right) &=\mathcal{N}\left(\boldsymbol{x}_n \mid \boldsymbol{\mu}_k+\mathbf{W}_k \boldsymbol{z}_n, \boldsymbol{\Psi}_k\right) \ p\left(\boldsymbol{z}_n \mid \boldsymbol{\theta}\right) &=\mathcal{N}\left(\boldsymbol{z}_n \mid \mathbf{0}, \mathbf{I}\right) \ p\left(c_n \mid \boldsymbol{\theta}\right) &=\operatorname{Cat}\left(c_n \mid \boldsymbol{\pi}\right) \end{aligned}
This is called a mixture of factor analysers (MFA) [GH96]. The corresponding distribution in the visible space is given by
$$p(\boldsymbol{x} \mid \boldsymbol{\theta})=\sum_k p(c=k) \int d \boldsymbol{z} p(\boldsymbol{z} \mid c) p(\boldsymbol{x} \mid \boldsymbol{z}, c)=\sum_k \pi_k \int d \boldsymbol{z} \mathcal{N}\left(\boldsymbol{z} \mid \boldsymbol{\mu}_k, \mathbf{I}\right) \mathcal{N}\left(\boldsymbol{x} \mid \mathbf{W} \boldsymbol{z}, \sigma^2 \mathbf{I}\right)$$
In the special case that $\Psi_k=v^2 \mathbf{I}$, we get a mixture of PPCA models (although it is difficult to ensure orthogonality of the $\mathbf{W}_k$ in this case). See Figure $20.12$ for an example of the method applied to some $2 \mathrm{~d}$ data.

We can think of this as a low-rank version of a mixture of Gaussians. In particular, this model needs $O(K L D)$ parameters instead of the $O\left(K D^2\right)$ parameters needed for a mixture of full covariance Gaussians. This can reduce overfitting.

## 计算机代写|机器学习代写machine learning代考|Mixtures of factor analysers

