## 数学代写|基础数据分析代写Elementary data Analysis代考|Over-Fitting and Model Selection

The big problem with using the in-sample error is related to over-optimism, but at once trickier to grasp and more important. This is the problem of over-fitting. To illustrate it, let’s start with Figure 3.2. This has the twenty $X$ values from a Gaussian distribution, and $Y=7 X^2-0.5 X+\epsilon, \epsilon \sim \mathscr{N}(0,1)$. That is, the true regression curve is a parabola, with additive and independent Gaussian noise. Let’s try fitting this $-$ but pretend that we didn’t know that the curve was a parabola. We’ll try fitting polynomials of different degrees in $x-$ degree 0 (a flat line), degree 1 (a linear regression), degree 2 (quadratic regression), up through degree 9 . Figure $3.3$ shows the data with the polynomial curves, and Figure $3.4$ shows the in-sample mean squared error as a function of the degree of the polynomial.

Notice that the in-sample error goes down as the degree of the polynomial increases; it has to. Every polynomial of degree $p$ can also be written as a polynomial of degree $p+1$ (with a zero coefficient for $x^{p+1}$ ), so going to a higher-degree model can only reduce the in-sample error. Quite generally, in fact, as one uses more and more complex and flexible models, the in-sample error will get smaller and smaller. ${ }^5$
Things are quite different if we turn to the generalization error. In principle, I could calculate that for any of the models, since I know the true distribution, but it would involve calculating things like $\mathbb{E}\left[X^{18}\right]$, which won’t be very illuminating. Instead, I will just draw a lot more data from the same source, twenty thousand data points in fact, and use the error of the old models on the new data as their generalization error ${ }^6$. The results are in Figure $3.5$.

What is happening here is that the higher-degree polynomials – beyond degree 2 – are not just a little optimistic about how well they fit, they are wildly overoptimistic. The models which seemed to do notably better than a quadratic actu-ally do much, much worse. If we picked a polynomial regression model based on in-sample fit, we’d chose the highest-degree polynomial available, and suffer for it.

## 数学代写|基础数据分析代写Elementary data Analysis代考|Leave-one-out Cross-Validation

Suppose we did $k$-fold cross-validation, but with $k=n$. Our testing sets would then consist of single points, and each point would be used in testing once. This is called leave-one-out cross-validation. It actually came before $k$-fold cross-validation, and has two advantages. First, it doesn’t require any random number generation, or keeping track of which data point is in which subset. Second, and more importantly, because we are only testing on one data point, it’s often possible to find what the prediction on the left-out point would be by doing calculations on a model fit to the whole data. (See below.) This means that we only have to fit each model once, rather than $k$ times, which can be a big savings of computing time.

The drawback to leave-one-out $\mathrm{CV}$ is subtle but often decisive. Since each training set has $n-1$ points, any two training sets must share $n-2$ points. The models fit to those training sets tend to be strongly correlated with each other. Even though we are averaging $n$ out-of-sample forecasts, those are correlated forecasts, so we are not really averaging away all that much noise. With $k$-fold $\mathrm{CV}$, on the other hand, the fraction of data shared between any two training sets is just $\frac{k-2}{k-1}$, not $\frac{n-2}{n-1}$, so even though the number of terms being averaged is smaller, they are less correlated.

There are situations where this issue doesn’t really matter, or where it’s overwhelmed by leave-one-out’s advantages in speed and simplicity, so there is certainly still a place for it, but one subordinate to $k$-fold $\mathrm{CV}$. ${ }^9$

A Short-cut for Linear Smoothers Suppose the model $m$ is a linear smoother $(\$ 1.5)$. For each of the data points$i$, then, the predicted value is a linear combination of the observed values of$y, m\left(x_i\right)=\sum_j \hat{w}\left(x_i, x_j\right) y_j(\mathrm{Eq}$. 1.48). As in$\ 1.5 .3$, define the “influence”, “smoothing” or “hat” matrix$\hat{w}$by$\hat{w}{i j}=\hat{w}\left(x_i, x_j\right)$. What happens when we hold back data point$i$, and then make a prediction at$x_i$? Well, the observed response at$i$can’t contribute to the prediction, but otherwise the linear smoother should work as before, so $$m^{(-i)}\left(x_i\right)=\frac{(\hat{\mathbf{w} y})_i-\hat{w}{i i} y_i}{1-\hat{w}_{i i}}$$ The numerator just removes the contribution to$m\left(x_i\right)$that came from$y_i$, and the denominator just re-normalizes the weights in the smoother. # 基础数据分析代考 ## 数学代写|基础数据分析代写基本数据分析代考|过拟合与模型选择 . 使用样本内误差的大问题与过度乐观有关，但同时更棘手，也更重要。这就是过拟合的问题。为了说明它，让我们从图3.2开始。它有来自高斯分布的20个$X$值，以及$Y=7 X^2-0.5 X+\epsilon, \epsilon \sim \mathscr{N}(0,1)$。也就是说，真正的回归曲线是一个抛物线，具有加性和独立的高斯噪声。让我们试着拟合这个$-$，但假设我们不知道这条曲线是抛物线。我们将尝试在$x-$中拟合不同程度的多项式，0次(一条平线)，1次(一条线性回归)，2次(二次回归)，直到9次。图$3.3$显示了具有多项式曲线的数据，图$3.4$显示了样本内均方误差作为多项式次数的函数 注意，样本内误差随着多项式次数的增加而减小;这是必须的。每个$p$次的多项式也可以写成$p+1$次的多项式($x^{p+1}$的系数为零)，因此使用更高次的模型只能减少样本内误差。实际上，一般来说，当一个人使用越来越复杂和灵活的模型时，样本内误差会越来越小。${ }^5$如果我们转向泛化误差，情况就大不相同了。原则上，我可以对任何模型进行计算，因为我知道真实的分布，但这需要计算像$\mathbb{E}\left[X^{18}\right]$这样的东西，这不是很有启发性。相反，我将从同一来源中提取更多的数据，实际上是2万个数据点，并使用旧模型对新数据的误差作为他们的泛化误差${ }^6$。结果如图$3.5$。 这里的情况是，高次多项式-超过2次-对它们的拟合程度不仅仅是有点乐观，他们是过分乐观了。那些看起来比二次方程做得更好的模型实际上做得更差。如果我们选择一个基于样本内拟合的多项式回归模型，我们将选择最高次多项式，并为此付出代价 ## 数学代写|基础数据分析代写基本数据分析代考|遗漏一个交叉验证 . 假设我们使用$k$-fold交叉验证，但使用$k=n$。我们的测试集将由单个点组成，并且每个点将在测试中使用一次。这被称为遗漏一个交叉验证。它实际上出现在$k$-fold交叉验证之前，有两个优点。首先，它不需要生成任何随机数，也不需要跟踪哪个数据点在哪个子集中。其次，也是更重要的是，因为我们只在一个数据点上进行测试，所以通常可以通过对适合整个数据的模型进行计算，找到遗漏点上的预测结果。(见下文)这意味着我们只需要拟合每个模型一次，而不是$k$次，这可以节省大量的计算时间 漏掉一个$\mathrm{CV}$的缺点是微妙的，但往往是决定性的。由于每个训练集有$n-1$点，所以任意两个训练集必须共享$n-2$点。与这些训练集相适应的模型之间具有很强的相关性。尽管我们对$n$样本外预测进行了平均，但这些预测是相关的，所以我们并没有真正平均掉所有的噪音。另一方面，对于$k$-fold$\mathrm{CV}$，任何两个训练集之间共享的数据比例只是$\frac{k-2}{k-1}$，而不是$\frac{n-2}{n-1}$，因此，即使被平均的术语数量更小，它们的相关性也更低 有些情况下，这个问题真的不重要，或者它被省去一个的优势在速度和简单，所以它当然仍然有一个地方，但从属于$k$-fold$\mathrm{CV}$。${ }^9$线性平滑的捷径假设模型$m$是线性平滑器$(\$1.5)$。那么，对于每个数据点$i$，预测值是$y, m\left(x_i\right)=\sum_j \hat{w}\left(x_i, x_j\right) y_j(\mathrm{Eq}$的观测值的线性组合。1.48)。如同在$\ 1.5 .3$中一样，通过$\hat{w}{i j}=\hat{w}\left(x_i, x_j\right)$定义“影响”、“平滑”或“帽子”矩阵$\hat{w}$。当我们保留数据点$i$，然后在$x_i$进行预测时会发生什么?好吧，在$i$上观察到的响应不能对预测做出贡献，但否则线性平滑器应该像以前一样工作，所以$$m^{(-i)}\left(x_i\right)=\frac{(\hat{\mathbf{w} y})_i-\hat{w}{i i} y_i}{1-\hat{w}_{i i}}$$

myassignments-help数学代考价格说明

1、客户需提供物理代考的网址，相关账户，以及课程名称，Textbook等相关资料~客服会根据作业数量和持续时间给您定价~使收费透明，让您清楚的知道您的钱花在什么地方。

2、数学代写一般每篇报价约为600—1000rmb，费用根据持续时间、周作业量、成绩要求有所浮动(持续时间越长约便宜、周作业量越多约贵、成绩要求越高越贵)，报价后价格觉得合适，可以先付一周的款，我们帮你试做，满意后再继续，遇到Fail全额退款。

3、myassignments-help公司所有MATH作业代写服务支持付半款，全款，周付款，周付款一方面方便大家查阅自己的分数，一方面也方便大家资金周转，注意:每周固定周一时先预付下周的定金，不付定金不予继续做。物理代写一次性付清打9.5折。

Math作业代写、数学代写常见问题

myassignments-help擅长领域包含但不是全部: