# 计算机代写|机器学习代写machine learning代考|COMP3670

## 计算机代写|机器学习代写machine learning代考|Residual connections

One solution to the vanishing gradient problem for DNNs is to use a residual network or ResNet [He $+16 \mathrm{a}]$. This is a feedforward model in which each layer has the form of a residual block, defined by
$$\mathcal{F}_l^{\prime}(\boldsymbol{x})=\mathcal{F}_l(\boldsymbol{x})+\boldsymbol{x}$$
where $\mathcal{F}_l$ is a standard shallow nonlinear mapping (e.g., linear-activation-linear). The inner $\mathcal{F}_l$ function computes the residual term or delta that needs to be added to the input $\boldsymbol{x}$ to generate the desired output; it is often easier to learn to generate a small perturbation to the input than to directly predict the output. (Residual connections are usually used in conjunction with CNNs, as discussed in Section 14.3.4, but can also be used in MLPs.)

A model with residual connections has the same number of parameters as a model without residual connections, but it is easier to train. The reason is that gradients can flow directly from the output to earlier layers, as sketched in Figure 13.15b. To see this, note that the activations at the output layer can be derived in terms of any previous layer $l$ using
$$z_L=\boldsymbol{z}l+\sum{i=l}^{L-1} \mathcal{F}i\left(\boldsymbol{z}_i ; \boldsymbol{\theta}_i\right) .$$ We can therefore compute the gradient of the loss wrt the parameters of the $l$ ‘th layer as follows: \begin{aligned} \frac{\partial \mathcal{L}}{\partial \boldsymbol{\theta}_l} &=\frac{\partial z_l}{\partial \boldsymbol{\theta}_l} \frac{\partial \mathcal{L}}{\partial \boldsymbol{z}_l} \ &=\frac{\partial \boldsymbol{z}_l}{\partial \boldsymbol{\theta}_l} \frac{\partial \mathcal{L}}{\partial \boldsymbol{z}_L} \frac{\partial \boldsymbol{z}_L}{\partial \boldsymbol{z}_l} \ &=\frac{\partial \boldsymbol{z}_l}{\partial \boldsymbol{\theta}_l} \frac{\partial \mathcal{L}}{\partial \boldsymbol{z}_L}\left(1+\sum{i=l}^{L-1} \frac{\partial \mathcal{F}_i\left(\boldsymbol{z}_i ; \boldsymbol{\theta}_i\right)}{\partial \boldsymbol{z}_l}\right) \ &=\frac{\partial \boldsymbol{z}_l}{\partial \boldsymbol{\theta}_l} \frac{\partial \mathcal{L}}{\partial \boldsymbol{z}_L}+\text { otherterms } \end{aligned}
Thus we see that the gradient at layer $l$ depends directly on the gradient at layer $L$ in a way that is independent of the depth of the network.

## 计算机代写|机器学习代写machine learning代考|Heuristic initialization schemes

In [GB10], they show that sampling parameters from a standard normal with fixed variance can result in exploding activations or gradients. ‘lo see why, consider a linear unit with no activation function given by $o_i=\sum_{j=1}^{n_{\text {in }}} w_{i j} x_j ;$ suppose $w_{i j} \sim \mathcal{N}\left(0, \sigma^2\right)$, and $\mathbb{E}\left[x_j\right]=0$ and $\mathbb{V}\left[x_j\right]=\gamma^2$, where we assume $x_j$ are independent of $w_{i j}$. The mean and variance of the output is given by
\begin{aligned} &\mathbb{E}\left[o_i\right]=\sum_{j=1}^{n_{\text {in }}} \mathbb{E}\left[w_{i j} x_j\right]=\sum_{j=1}^{n_{\text {in }}} \mathbb{E}\left[w_{i j}\right] \mathbb{E}\left[x_j\right]=0 \ &\mathbb{V}\left[o_i\right]=\mathbb{E}\left[o_i^2\right]-\left(\mathbb{E}\left[o_i\right]\right)^2=\sum_{j=1}^{n_{\text {in }}} \mathbb{E}\left[w_{i j}^2 x_j^2\right]-0=\sum_{j=1}^{n_{\text {in }}} \mathbb{E}\left[w_{i j}^2\right] \mathbb{E}\left[x_j^2\right]=n_{\text {in }} \sigma^2 \gamma^2 \end{aligned}
To keep the output variance from blowing up, we need to ensure $n_{\mathrm{in}} \sigma^2=1$ (or some other constant), where $n_{\text {in }}$ is the fan-in of a unit (number of incoming connections).

Now consider the backwards pass. By analogous reasoning, we see that the variance of the gradients can blow up unless $n_{\text {out }} \sigma^2=1$, where $n_{\text {out }}$ is the fan-out of a unit (number of outgoing connections).

To satisfy both requirements at once, we set $\frac{1}{2}\left(n_{\text {in }}+n_{\text {out }}\right) \sigma^2=1$, or equivalently
$$\sigma^2=\frac{2}{n_{\text {in }}+n_{\text {out }}}$$
This is known as Xavier initialization or Glorot initialization, named after the first author of [GB10].

A special case arises if we use $\sigma^2=1 / n_{\text {in }}$; this is known as LeCun initialization, named after Yann LeCun, who proposed it in the 1990s. This is equivalent to Glorot initialization when $n_{\text {in }}=n_{\text {out }}$. If we use $\sigma^2=2 / n_{\text {in }}$, the method is called He initialization, named after Ximing He, who proposed it in $[\mathrm{He}+15]$.

Note that it is not necessary to use a Gaussian distribution. Indeed, the above derivation just worked in terms of the first two moments (mean and variance), and made no assumptions about Gaussianity. For example, suppose we sample weights from a uniform distribution, $w_{i j} \sim \operatorname{Unif}(-a, a)$. The mean is 0 , and the variance is $\sigma^2=a^2 / 3$. Hence we should set $a=\sqrt{\frac{6}{n_{\mathrm{in}}+n_{\mathrm{ou}}}}$.

Although the above derivation assumes a linear output unit, the technique works well empirically even for nonlinear units. The best choice of initialization method depends on which activation function you use. For linear, tanh, logistic, and softmax, Glorot is recommended. For ReLU and variants, He is recommended. For SELU, LeCun is recommended. See e.g., [Ger19] for more heuristics, and e.g., [HDR19] for some theory.

# 机器学习代考

## 计算机代写|机器学习代写machine learning代考|Residual connections

DNN 的梯庻消失问题的一种解决方案是使用残差网络或 ResNet [He+16a]. 这是一个前绩模型，其中每

$$\mathcal{F}_l^{\prime}(\boldsymbol{x})=\mathcal{F}_l(\boldsymbol{x})+\boldsymbol{x}$$

$$z_L=\boldsymbol{z} l+\sum i=l^{L-1} \mathcal{F} i\left(\boldsymbol{z}_i ; \boldsymbol{\theta}_i\right) .$$

$$\frac{\partial \mathcal{L}}{\partial \boldsymbol{\theta}_l}=\frac{\partial z_l}{\partial \boldsymbol{\theta}_l} \frac{\partial \mathcal{L}}{\partial \boldsymbol{z}_l} \quad=\frac{\partial \boldsymbol{z}_l}{\partial \boldsymbol{\theta}_l} \frac{\partial \mathcal{L}}{\partial \boldsymbol{z}_L} \frac{\partial \boldsymbol{z}_L}{\partial \boldsymbol{z}_l}=\frac{\partial \boldsymbol{z}_l}{\partial \boldsymbol{\theta}_l} \frac{\partial \mathcal{L}}{\partial \boldsymbol{z}_L}\left(1+\sum i=l^{L-1} \frac{\partial \mathcal{F}_i\left(\boldsymbol{z}_i ; \boldsymbol{\theta}_i\right)}{\partial \boldsymbol{z}_l}\right) \quad=\frac{\partial \boldsymbol{z}_l}{\partial \boldsymbol{\theta}_l} \frac{\partial \mathcal{L}}{\partial \boldsymbol{z}_L}$$

## 计算机代写|机器学习代写machine learning代考|Heuristic initialization schemes

$$\mathbb{E}\left[o_i\right]=\sum_{j=1}^{n_{\text {in }}} \mathbb{E}\left[w_{i j} x_j\right]=\sum_{j=1}^{n_{\text {in }}} \mathbb{E}\left[w_{i j}\right] \mathbb{E}\left[x_j\right]=0 \quad \mathbb{V}\left[o_i\right]=\mathbb{E}\left[o_i^2\right]-\left(\mathbb{E}\left[o_i\right]\right)^2=\sum_{j=1}^{n_{\text {in }}} \mathbb{E}\left[w_{i j}^2 x_j^2\right]-0$$

$$\sigma^2=\frac{2}{n_{\text {in }}+n_{\text {out }}}$$

myassignments-help数学代考价格说明

1、客户需提供物理代考的网址，相关账户，以及课程名称，Textbook等相关资料~客服会根据作业数量和持续时间给您定价~使收费透明，让您清楚的知道您的钱花在什么地方。

2、数学代写一般每篇报价约为600—1000rmb，费用根据持续时间、周作业量、成绩要求有所浮动(持续时间越长约便宜、周作业量越多约贵、成绩要求越高越贵)，报价后价格觉得合适，可以先付一周的款，我们帮你试做，满意后再继续，遇到Fail全额退款。

3、myassignments-help公司所有MATH作业代写服务支持付半款，全款，周付款，周付款一方面方便大家查阅自己的分数，一方面也方便大家资金周转，注意:每周固定周一时先预付下周的定金，不付定金不予继续做。物理代写一次性付清打9.5折。

Math作业代写、数学代写常见问题

myassignments-help擅长领域包含但不是全部: