Even though the ADF’s sequential approach is better than independently approximating each factor, it depends on the ordering of the factors. If the first factors lead to a bad approximation, the ADF produces a poor final estimate of the posterior. We could mitigate this issue at the expense of losing the online characteristic of the method by revising the initial approximations later on, effectively cycling through all factors.

Similarly to ADF, the variance of the approximating distribution is affected by both the independence assumption needed for the factorization of the distribution and the mass spreading property of the forward KL. However, differently from $\mathrm{ADF}$, the ADF overestimates the marginal variance, giving larger uncertainty estimations and variability than the true posterior would. One should take the variance overestimation property into account when choosing among the different variational methods to solve a given problem.

The EP reinterprets the ADF as approximating each new true factor $f_i$ with $\tilde{f}_i$ such that
$$q^{(i)}(\mathbf{z}) \propto q^{(i-1)}(\mathbf{z}) \tilde{f}_i .$$
The approximate factor $\widetilde{f}_i$ can be easily obtained at the end of the $i$ th ADF iteration by

This shift in view means that $q$ can be seen as a product of the approximate factors $\widetilde{f}i$, such that $$q(\mathbf{z}) \propto \frac{q^{(N)}(\mathbf{z})}{q^{(N-1)}(\mathbf{z})} \ldots \frac{q^{(1)}(\mathbf{z})}{q^{(0)}(\mathbf{z})}=\prod{i=1}^N \tilde{f}_i(\mathbf{z}),$$
where $q^{(0)}(\mathbf{z})=p_0(\mathbf{z})$ is the prior distribution.
In ADF, initial factors have little context: few to none other factors have been seen; so they are prone to poor approximation. On the other hand, later factors have large context and potential to be better approximated. The EP handles this issue by observing the entire context when approximating $f_i$ with $\widetilde{f}i$. Since it keeps track of each $f_i$ and the corresponding $\widetilde{f}_i$ at every iteration, it is possible to compute $$q{\text {new }}(\mathbf{z})=\underset{q \in Q}{\operatorname{argmin}} D_{K L}\left(\frac{1}{K_i} f_i(\mathbf{z}) \frac{q(\mathbf{z})}{\widetilde{f}_i(\mathbf{z})} | q(\mathbf{z})\right),$$
where $K_i$ is the normalizing constant. Note that now, at any given iteration $j, q$ no longer is the product of factors $1<k<j$, but of all $N$ factors. That is why we have dropped the superscript in $q$.

