# 计算机代写|机器学习代写machine learning代考|CS7641

## 计算机代写|机器学习代写machine learning代考|Regularization effects of (stochastic) gradient descent

Some optimization methods (in particular, second-order batch methods) are able to find “needles in haystacks”, corresponding to narrow but deep “holes” in the loss landscape, corresponding to parameter settings with very low loss. These are known as sharp minima, see Figure 13.19(right). From the point of view of minimizing the empirical loss, the optimizer has done a good job. However, such solutions generally correspond to a model that has overfit the data. It is better to find points that correspond to flat minima, as shown in Figure 13.19(left); such solutions are more robust and generalize better. To see why, note that flat minima correspond to regions in parameter space where there is a lot of posterior uncertainty, and hence samples from this region are less able to precisely memorize irrelevant details about the training set [AS17]. SGD often finds such flat minima by virtue of the addition of noise, which prevents it from “entering” narrow regions of the loss landscape (see e.g., [SL18]). This is called implicit regularization. It is also possible to explicitly encourage SGD to find such flat minima, using entropy SGD [Cha+17], sharpness aware minimization [For $+21$ ], stochastic weight averaging (SWA) [Izm $+18]$, and other related techniques.

Of course, the loss landscape depends not just on the parameter values, but also on the data. Since we usually cannot afford to do full-batch gradient descent, we will get a set of loss curves, one per minibatch. If each one of these curves corresponds to a wide basin, as shown in Figure 13.20a, we are at a point in parameter space that is robust to perturbations, and will likely generalize well. However, if the overall wide basin is the result of averaging over many different narrow basins, as shown in Figure 13.20b, the resulting estimate will likely generalize less well.

This can be formalized using the analysis in [Smi $+21$; BD21]. Specifically, they consider continuous time gradient flow which approximates the behavior of (S)GD. In [BD21], they consider full-batch GD, and show that the flow has the form $\dot{\boldsymbol{w}}=-\nabla_{\boldsymbol{w}} \tilde{\mathcal{L}}{G D}(\boldsymbol{w})$, where $$\tilde{\mathcal{L}}{G D}(\boldsymbol{w})=\mathcal{L}(\boldsymbol{w})+\frac{\epsilon}{4}|\nabla \mathcal{L}(\boldsymbol{w})|^2$$
where $\mathcal{L}(\boldsymbol{w})$ is the original loss, $\epsilon$ is the learning rate, and the second term is an implicit regularization term that penalizes solutions with large gradients (high curvature).

In [Smi+21], they extend this analysis to the SGD case. They show that the flow has the form $\dot{\boldsymbol{w}}=-\nabla_{\boldsymbol{w}} \tilde{\mathcal{L}}{S G D}(\boldsymbol{w})$, where $$\tilde{\mathcal{L}}{S G D}(\boldsymbol{w})=\mathcal{L}(\boldsymbol{w})+\frac{\epsilon}{4} \sum_{k=1}^m\left|\nabla \mathcal{L}k(\boldsymbol{w})\right|^2$$ where $m$ is the number of minibatches, and $\mathcal{L}_k(\boldsymbol{w})$ is the loss on the $k$ ‘th such minibatch. Comparing this to the full-batch GD loss, we see $$\tilde{\mathcal{L}}{S G D}(\boldsymbol{w})=\tilde{\mathcal{L}}{G D}(\boldsymbol{w})+\frac{\epsilon}{4} \sum{k=1}^m\left|\nabla \mathcal{L}_k(\boldsymbol{w})-\mathcal{L}(\boldsymbol{w})\right|^2$$

## 计算机代写|机器学习代写machine learning代考|Radial basis function networks

Consider a 1 layer neural net where the hidden layer is given by the feature vector
$$\boldsymbol{\phi}(\boldsymbol{x})=\left[\mathcal{K}\left(\boldsymbol{x}, \boldsymbol{\mu}1\right), \ldots, \mathcal{K}\left(\boldsymbol{x}, \boldsymbol{\mu}_K\right)\right]$$ where $\boldsymbol{\mu}_k \in \mathcal{X}$ are a set of $K$ centroids or exemplars, and $\mathcal{K}(\boldsymbol{x}, \boldsymbol{\mu}) \geq 0$ is a kernel function. We describe kernel functions in detail in Section 17.1. Here we just give an example, namely the Gaussian kernel $$\mathcal{K}{\text {gauss }}(\boldsymbol{x}, \boldsymbol{c}) \triangleq \exp \left(-\frac{1}{2 \sigma^2}|\boldsymbol{c}-\boldsymbol{x}|_2^2\right)$$
The parameter $\sigma$ is known as the bandwidth of the kernel. Note that this kernel is shift invariant, meaning it is only a function of the distance $r=|\boldsymbol{x}-\boldsymbol{c}|_2$, so we can equivalently write this as
$$\mathcal{K}_{\text {gauss }}(r) \triangleq \exp \left(-\frac{1}{2 \sigma^2} r^2\right)$$
This is therefore called a radial basis function kernel or RBF kernel.
A 1 layer neural net in which we use Equation (13.101) as the hidden layer, with RBF kernels, is called an RBF network [BL88]. This has the form
$$p(y \mid \boldsymbol{x}, \boldsymbol{\theta})=p\left(y \mid \boldsymbol{w}^{\top} \boldsymbol{\phi}(\boldsymbol{x})\right)$$
where $\boldsymbol{\theta}=(\boldsymbol{\mu}, \boldsymbol{w})$. If the centroids $\boldsymbol{\mu}$ are fixed, we can solve for the optimal weights $\boldsymbol{w}$ using (regularized) least squares, as discussed in Chapter 11. If the centroids are unknown, we can estimate them by using an unsupervised clustering method, such as $K=$ means (Section 21.3). Alternatively, we can associate one centroid per data point in the training set, to get $\boldsymbol{\mu}_n=\boldsymbol{x}_n$, where now $K=N$. This is an example of a non-parametric model, since the number of parameters grows (in this case linearly) with the amount of data, and is not independent of $N$. If $K=N$, the model can perfectly interpolate the data, and hence may overfit. However, by ensuring that the output weight vector $\boldsymbol{w}$ is sparse, the model will only use a finite subset of the input examples; this is called a sparse kernel machine, and will be discussed in more detail in Section 17.4.1 and Section 17.3. Another way to avoid overfitting is to adopt a Bayesian approach, by integrating out the weights $w$; this gives rise to a model called a Gaussian process, which will be discussed in more detail in Section 17.2.

# 机器学习代考

## 计算机代写|机器学习代写machine learning代考|Regularization effects of (stochastic) gradient descent

$$\tilde{\mathcal{L}} G D(\boldsymbol{w})=\mathcal{L}(\boldsymbol{w})+\frac{\epsilon}{4}|\nabla \mathcal{L}(\boldsymbol{w})|^2$$

$$\tilde{\mathcal{L}} S G D(\boldsymbol{w})=\mathcal{L}(\boldsymbol{w})+\frac{\epsilon}{4} \sum_{k=1}^m|\nabla \mathcal{L} k(\boldsymbol{w})|^2$$

$$\tilde{\mathcal{L}} S G D(\boldsymbol{w})=\tilde{\mathcal{L}} G D(\boldsymbol{w})+\frac{\epsilon}{4} \sum k=1^m\left|\nabla \mathcal{L}_k(\boldsymbol{w})-\mathcal{L}(\boldsymbol{w})\right|^2$$

## 计算机代写|机器学习代写machine learning代考|Radial basis function networks

$$\boldsymbol{\phi}(\boldsymbol{x})=\left[\mathcal{K}(\boldsymbol{x}, \boldsymbol{\mu} 1), \ldots, \mathcal{K}\left(\boldsymbol{x}, \boldsymbol{\mu}K\right)\right]$$ 在哪里 $\boldsymbol{\mu}_k \in \mathcal{X}$ 是一组 $K$ 质心或样本，以及 $\mathcal{K}(\boldsymbol{x}, \boldsymbol{\mu}) \geq 0$ 是一个核函数。我们在 $17.1$ 节详细描述了核函 数。这里我们只举一个例子，即高斯核 $$\mathcal{K}{\text {gauss }}(\boldsymbol{x}, \boldsymbol{c}) \triangleq \exp \left(-\frac{1}{2 \sigma^2}|\boldsymbol{c}-\boldsymbol{x}|2^2\right)$$ 参数 $\sigma$ 被称为内核的带宽。请注意，此内核是移位不变的，这意味着它只是距离的函数 $r=|\boldsymbol{x}-\boldsymbol{c}|_2$ ，所 以我们可以等价地写成 $$\mathcal{K}{\text {gauss }}(r) \triangleq \exp \left(-\frac{1}{2 \sigma^2} r^2\right)$$

$$p(y \mid \boldsymbol{x}, \boldsymbol{\theta})=p\left(y \mid \boldsymbol{w}^{\top} \boldsymbol{\phi}(\boldsymbol{x})\right)$$

myassignments-help数学代考价格说明

1、客户需提供物理代考的网址，相关账户，以及课程名称，Textbook等相关资料~客服会根据作业数量和持续时间给您定价~使收费透明，让您清楚的知道您的钱花在什么地方。

2、数学代写一般每篇报价约为600—1000rmb，费用根据持续时间、周作业量、成绩要求有所浮动(持续时间越长约便宜、周作业量越多约贵、成绩要求越高越贵)，报价后价格觉得合适，可以先付一周的款，我们帮你试做，满意后再继续，遇到Fail全额退款。

3、myassignments-help公司所有MATH作业代写服务支持付半款，全款，周付款，周付款一方面方便大家查阅自己的分数，一方面也方便大家资金周转，注意:每周固定周一时先预付下周的定金，不付定金不予继续做。物理代写一次性付清打9.5折。

Math作业代写、数学代写常见问题

myassignments-help擅长领域包含但不是全部: