# 数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|STAT991

## 数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|MLP and its derivative

The basic MLP $a \mapsto h_{W, u}(a)$ takes as input a feature vector $a \in \mathbb{R}^p$, computes an intermediate hidden representation $b=W a \in \mathbb{R}^q$ using $q$ “neurons” stored as the rows $w_k \in \mathbb{R}^p$ of the weight matrix $W \in \mathbb{R}^{q \times p}$, passes these through a non-linearity $\rho: \mathbb{R} \rightarrow \mathbb{R}$, i.e. $\rho(b)=\left(\rho\left(b_k\right)\right){k=1}^q$ and then outputs a scalar value as a linear combination with output weights $u \in \mathbb{R}^q$, i.e. $$h{W, u}(a)=\langle\rho(W a), u\rangle=\sum_{k=1}^q u_k \rho\left((W a)k\right)=\sum{k=1}^q u_k \rho\left(\left\langle a, w_k\right\rangle\right) .$$
This function $h_{W, u}(\cdot)$ is thus a weighted sum of $q$ “ridge functions” $\rho\left(\left\langle\cdot, w_k\right\rangle\right)$. These functions are constant in the direction orthogonal to the neuron $w_k$ and have a profile defined by $\rho$.
The most popular non-linearities are sigmoid functions such as
$$\rho(r)=\frac{e^r}{1+e^r} \quad \text { and } \quad \rho(r)=\frac{1}{\pi} \operatorname{atan}(r)+\frac{1}{2}$$
and the rectified linear unit (ReLu) function $\rho(r)=\max (r, 0)$.
One often add a bias term in these models, and consider functions of the form $\rho\left(\left\langle\cdot, w_k\right\rangle+z_k\right)$ but this bias term can be integrated in the weight as usual by considering $\left(\left\langle a, w_k\right\rangle+z_k=\left\langle(a, 1),\left(w_k, z_k\right)\right\rangle\right.$, so we ignore it in the following section. This simply amount to replacing $a \in \mathbb{R}^p$ by $(a, 1) \in \mathbb{R}^{p+1}$ and adding a dimension $p \mapsto p+1$, as a pre-processing of the features.

## 数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|MLP and Gradient Computation

Given pairs of features and data values $\left(a_i, y_i\right){i=1}^n$, and as usual storing the features in the rows of $A \in \mathbb{R}^{n \times p}$, we consider the following least square regression function (similar computation can be done for classification losses) $$\min {x=(W, u)} f(W, u) \stackrel{\text { def }}{=} \frac{1}{2} \sum_{i=1}^n\left(h_{W, u}\left(a_i\right)-y_i\right)^2=\frac{1}{2}\left|\rho\left(A W^{\top}\right) u-y\right|^2 .$$
Note that here, the parameters being optimized are $(W, u) \in \mathbb{R}^{q \times p} \times \mathbb{R}^q$.
Optimizing with respect to $u$. This function $f$ is convex with respect to $u$, since it is a quadratic function. Its gradient with respect to $u$ can be computed as in (8) and thus
$$\nabla_u f(W, u)=\rho\left(A W^{\top}\right)^{\top}\left(\rho\left(A W^{\top}\right) u-y\right)$$
and one can compute in closed form the solution (assuming $\left.\operatorname{ker}\left(\rho\left(A W^{\top}\right)\right)={0}\right)$ as
$$u^{\star}=\left[\rho\left(A W^{\top}\right)^{\top} \rho\left(A W^{\top}\right)\right]^{-1} \rho\left(A W^{\top}\right)^{\top} y=\left[\rho\left(W A^{\top}\right) \rho\left(A W^{\top}\right)\right]^{-1} \rho\left(W A^{\top}\right) y$$
When $W=\mathrm{Id}_p$ and $\rho(s)=s$ one recovers the least square formula $(9)$.
Optimizing with respect to $W$. The function $f$ is non-convex with respect to $W$ because the function $\rho$ is itself non-linear. Training a MLP is thus a delicate process, and one can only hope to obtain a local minimum of $f$. It is also important to initialize correctly the neurons $\left(w_k\right)_k$ (for instance as unit norm random vector, but bias terms might need some adjustment), while $u$ can be usually initialized at 0.

# 机器学习中的优化理论代考

## 数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|MLP and its derivative

$$h W, u(a)=\langle\rho(W a), u\rangle=\sum_{k=1}^q u_k \rho((W a) k)=\sum k=1^q u_k \rho\left(\left\langle a, w_k\right\rangle\right)$$

$$\rho(r)=\frac{e^r}{1+e^r} \quad \text { and } \quad \rho(r)=\frac{1}{\pi} \operatorname{atan}(r)+\frac{1}{2}$$

## 数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|MLP and Gradient Computation

$$\min x=(W, u) f(W, u) \stackrel{\text { def }}{=} \frac{1}{2} \sum_{i=1}^n\left(h_{W, u}\left(a_i\right)-y_i\right)^2=\frac{1}{2}\left|\rho\left(A W^{\top}\right) u-y\right|^2 .$$

$$\nabla_u f(W, u)=\rho\left(A W^{\top}\right)^{\top}\left(\rho\left(A W^{\top}\right) u-y\right)$$

$$u^{\star}=\left[\rho\left(A W^{\top}\right)^{\top} \rho\left(A W^{\top}\right)\right]^{-1} \rho\left(A W^{\top}\right)^{\top} y=\left[\rho\left(W A^{\top}\right) \rho\left(A W^{\top}\right)\right]^{-1} \rho\left(W A^{\top}\right) y$$

myassignments-help数学代考价格说明

1、客户需提供物理代考的网址，相关账户，以及课程名称，Textbook等相关资料~客服会根据作业数量和持续时间给您定价~使收费透明，让您清楚的知道您的钱花在什么地方。

2、数学代写一般每篇报价约为600—1000rmb，费用根据持续时间、周作业量、成绩要求有所浮动(持续时间越长约便宜、周作业量越多约贵、成绩要求越高越贵)，报价后价格觉得合适，可以先付一周的款，我们帮你试做，满意后再继续，遇到Fail全额退款。

3、myassignments-help公司所有MATH作业代写服务支持付半款，全款，周付款，周付款一方面方便大家查阅自己的分数，一方面也方便大家资金周转，注意:每周固定周一时先预付下周的定金，不付定金不予继续做。物理代写一次性付清打9.5折。

Math作业代写、数学代写常见问题

myassignments-help擅长领域包含但不是全部: