统计代写|回归分析作业代写Regression Analysis代考|STAT2220



统计代写|回归分析作业代写Regression Analysis代考|How to think about the estimate and its standard error

$\mathrm{Hmmm}$, the estimated slope is shown in the output as $1.6199$, and the standard error is shown in the output as $0.1326$. So the actual slope is most likely in the range $1.6199 \pm 2(0.1316)$, or roughly between $1.6 \pm 0.26$. AHA! The true slope is most likely a positive number! So the $X$ variable has a positive relation to $Y$ !
We used $2.0$ rather than $1.96$ as a multiplier of the standard error because the result is only approximate anyway, so why not? We might as well simplify things by using another approximation, $2.0$ instead of 1.96. It just makes life easier. And it works well in practice, so we generally recommend that you follow the advice given by the above mental conversation.

But there are precise, mathematically exact results that you can use in the case where the data are produced by the classical model. The theory is mathematically deep, but you probably have seen it before, to one degree or another. It involves “Student’s $T$ distribution,” which is ubiquitous in statistics. In a nutshell, the issue revolves around how to deal with the estimate $\hat{\sigma}$ of $\sigma$ in the standard error formula. After all, as shown above, the first interval formula involving $1.96$ and $\sigma$ is exact; the only reason for calling the second interval formula “approximate” is because of the substitution of $\hat{\sigma}$ for $\sigma$. The effect of using $\hat{\sigma}$ rather than $\sigma$ can be precisely, exactly, quantified. A mathematical theorem states that if the classical regression model produces the real data, then the additional variability incurred when you use $\hat{\sigma}$ rather than $\sigma$ is precisely accounted for by using the $T$ (Student’s T) distribution rather than the $Z$ (standard normal) distribution.

Specifically, the critical value $1.96$ is from the $Z$ (standard normal) distribution, the number that puts $95 \%$ probability between $-1.96$ and $1.96$. It is, therefore, the $0.975$ quantile of the standard normal distribution. In $\mathrm{R}$ it is qnorm (.975), which returns the even more precise value $1.959964$.

To account for the error in using the estimate $\hat{\sigma}$ of $\sigma$ in the standard error formula, you need to use the $T$ distribution rather than the $Z$ distribution. The $T$ distribution involves a “degrees of freedom” parameter, which in essence measures the accuracy of $\hat{\sigma}$ as an estimator of $\sigma$. This degrees of freedom quantity is mathematically identical to the divisor used to make the estimated variance an unbiased estimate:
d f e=n-\left(# \text { of } \beta^{\prime} s\right)
The “e” on “df” refers to “error”: Recall that, $\sigma$, the conditional standard deviation of $Y \mid X=x$, is also the standard deviation of the error term $\varepsilon$. You can think of $d f e$ as the “effective sample size” that is used to estimate the error standard deviation.

There is also a “model degrees of freedom” that we will discuss later, using the symbol $d f m$. The model degrees of freedom means something completely different: It refers to the flexibility (freedom) of the regression model; essentially the number of free parameters $\left(\beta^{\prime}\right.$ s) in the model, excluding the intercept.

To get exact intervals for regression coefficients, you use the quantiles of the $T_{\text {df }}$ distribution, rather than the quantiles of the $Z$ distribution. The mathematics is precise but will not be proved here: It states that, if the data are produced by the classical regression model, then you have the following result.

统计代写|回归分析作业代写Regression Analysis代考|Understanding “Exactness” and “Non-exactness” via Simulation

What does “exact” mean in these discussions? It means that the true confidence level is exactly $95 \%$ when you use a $95 \%$ confidence interval. Non-exactness means that the true confidence level is not equal to $95 \%$-it may be higher or lower than $95 \%$. Further, “true confidence level” refers to the true probability that the parameter lies within the prescribed confidence limits.

Here is a simple simulation to illustrate “exactness.” The data are simulated according to the classical model, the $95 \%$ interval for $\beta_1$ is calculated, and we check whether the true $\beta_1$ lies within the interval. Then we repeat that process 100,000 times, finding the proportion of the 100,000 intervals that contain the true $\beta_1$. This proportion should be close to $95 \%$ and will be exactly $95 \%$ with infinitely many (rather than 100,000 ) simulations.

On the other hand, when data are simulated from a model where the assumptions are violated, the proportion will be different from 95\%, even with infinitely many simulations. The simulation code that follows simulates data from the classical model, and also from the model with non-normal conditional distributions used to obtain Figure 1.11.

Thus, in the case where the classical model is true, $94.907 \%$ of the 100,000 samples gave a confidence interval that contained the true $\beta_1=1.5$. According to the mathematical theory, this percentage will be exactly $95 \%$ with infinitely many simulated data sets.

On the other hand, in the simulation where the conditional distributions are non-normal as illustrated in Figure 1.11,96.058\% of the 100,000 samples gave a confidence interval that contained the true $\beta_1=1.5$. The mathematical theory does not state that this percentage will be exactly $95 \%$ with infinitely many simulated data sets. In fact, the true percentage with infinitely many data sets will be more than $95 \%$ in this case.

The non-exactness of the confidence interval is not a huge problem for the given simulation study, because the actual confidence level is close to $95 \%$ in the non-normal case. This study provides an example of our common refrain: You can best understand why and whether violations of assumptions are problematic via simulation.

Violations of assumptions other than normality can cause bigger problems. Figure $3.2$ shows a case where the estimates are biased, and in such cases the intervals will systematically miss the target on the low side, leading to coverage rates close to $0 \%$ in extreme cases. Similarly, heteroscedasticity (non-constant variance) can cause the standard errors to be too small, also leading to coverage rates much lower than $95 \%$, which you can verify by using simulation.

As it turns out, violation of the normality assumption is not usually a major concern for the validity of confidence intervals for the $\beta$ parameters: Even with non-normal conditional distributions $p(y \mid x)$, the Central Limit Theorem dictates that the distribution of the parameter estimates will be approximately normal. Other inferences are not so robust to non-normality: The prediction interval discussed in Section $3.8$ below will behave quite poorly with non-normal processes. Inferences for variance parameters are similarly nonrobust. Further, even when OLS-based inferences are robust in the sense of having confidence levels near $95 \%$ under non-normality, the OLS estimates themselves can be quite inaccurate relative to $\mathrm{ML}$ estimates under non-normality.

统计代写|回归分析作业代写Regression Analysis代考|STAT2220



$\mathrm{Hmmm}$,估计的斜率在输出中表示为 $1.6199$,标准误差在输出中显示为 $0.1326$。所以实际斜率很可能在这个范围内 $1.6199 \pm 2(0.1316)$,或大致介于 $1.6 \pm 0.26$。啊哈!真正的斜率很可能是正数!所以 $X$ 变量与。呈正相关 $Y$
我们用 $2.0$ 而不是 $1.96$ 作为标准误差的乘数因为结果只是近似的,为什么不呢?我们也可以用另一种近似来简化, $2.0$ 而不是1.96。它只是让生活更容易。它在实践中效果很好,所以我们一般建议你遵循上述心理对话给出的建议 但是,在由经典模型产生的数据中,你可以使用精确的、数学上精确的结果。这个理论在数学上很深奥,但你可能在某种程度上以前见过。它涉及到“学生$T$分布”,这在统计学中是普遍存在的。简而言之,这个问题围绕着如何处理标准误差公式中$\sigma$的估计值$\hat{\sigma}$。毕竟如上所示,涉及$1.96$和$\sigma$的第一个区间公式是精确的;将第二个区间公式称为“近似”的唯一原因是用$\hat{\sigma}$替换了$\sigma$。使用$\hat{\sigma}$而不是$\sigma$的效果可以精确、准确地量化。一个数学定理表明,如果经典回归模型产生真实数据,那么当您使用$\hat{\sigma}$而不是$\sigma$时产生的额外的可变性可以通过使用$T$(学生T)分布而不是$Z$(标准正态分布)精确地解释 具体来说,临界值$1.96$来自$Z$(标准正态分布),这个数字将$95 \%$的概率放在$-1.96$和$1.96$之间。因此,它是标准正态分布的$0.975$分位数。在$\mathrm{R}$中是qnorm(.975),它返回更精确的值$1.959964$ .

d f e=n-\left(# \text { of } \beta^{\prime} s\right)
df上的“e”指的是“误差”:回想一下,$\sigma$是$Y \mid X=x$的条件标准差,也是误差项$\varepsilon$的标准差。你可以把$d f e$看作是用来估计误差标准差的“有效样本量”

还有一个“模型自由度”,我们将在后面讨论,使用符号$d f m$。模型自由度指的是完全不同的东西:它指的是回归模型的灵活性(自由度);本质上是模型中自由参数的数量$\left(\beta^{\prime}\right.$ s),不包括截距

要得到回归系数的精确区间,您可以使用$T_{\text {df }}$分布的分位数,而不是$Z$分布的分位数。数学是精确的,但在这里不会被证明:它指出,如果数据是由经典回归模型产生的,那么您会得到以下结果


在这些讨论中,“确切”是什么意思?这意味着当您使用$95 \%$置信区间时,真正的置信水平正是$95 \%$。非准确性意味着真正的置信度不等于$95 \%$,它可能高于或低于$95 \%$。此外,“真置信度”是指参数处于规定的置信度范围内的真概率

这里有一个简单的模拟来说明“准确性”。根据经典模型对数据进行模拟,计算$\beta_1$的$95 \%$区间,并检验真实的$\beta_1$是否在区间内。然后将该过程重复10万次,找出包含真实$\beta_1$的10万个区间的比例。这个比例应该接近$95 \%$,并且在无限次(而不是100000次)模拟情况下正好是$95 \%$


因此,在经典模型为真的情况下,100,000个样本中的$94.907 \%$给出了包含真实$\beta_1=1.5$的置信区间。根据数学理论,在无限多个模拟数据集的情况下,这个百分比正好是$95 \%$。

另一方面,在如图1.11所示的条件分布为非正态分布的模拟中,100,000个样本中96.058%给出了包含真实$\beta_1=1.5$的置信区间。数学理论并没有说明这个百分比在无限多个模拟数据集的情况下恰好是$95 \%$。事实上,在这种情况下,无限多个数据集的真实百分比将大于$95 \%$。

对于给定的模拟研究来说,置信区间的不精确不是一个大问题,因为在非正态情况下,实际的置信水平接近$95 \%$。这项研究提供了一个我们经常重复的例子:你可以通过模拟最好地理解为什么以及是否违反假设是有问题的。

违反正常以外的假设会导致更大的问题。图$3.2$显示了一种估计有偏差的情况,在这种情况下,间隔将系统地在较低的一侧错过目标,导致在极端情况下覆盖率接近$0 \%$。类似地,异方差(非恒定方差)会导致标准误差过小,也会导致覆盖率远低于$95 \%$,这可以通过模拟验证。

结果表明,违反正态性假设通常不是$\beta$参数置信区间有效性的主要问题:即使是非正态条件分布$p(y \mid x)$,中心极限定理规定参数估计的分布将近似正态。其他推论对于非正态性就不那么可靠了:下面$3.8$节中讨论的预测区间在非正态过程中表现得相当糟糕。方差参数的推论同样非鲁棒性。此外,即使基于OLS的推论在非正态下具有接近$95 \%$的置信水平的意义上是稳健的,OLS估计本身相对于$\mathrm{ML}$估计在非正态下可能相当不准确

统计代写|回归分析作业代写Regression Analysis代考







代写学科覆盖Math数学,经济代写,金融,计算机,生物信息,统计Statistics,Financial Engineering,Mathematical Finance,Quantitative Finance,Management Information Systems,Business Analytics,Data Science等。代写编程语言包括Python代写、Physics作业代写、物理代写、R语言代写、R代写、Matlab代写、C++代做、Java代做等。









付款操作:我们数学代考服务正常多种支付方法,包含paypal,visa,mastercard,支付宝,union pay。下单后与专家直接互动。






您的电子邮箱地址不会被公开。 必填项已用*标注

Scroll to Top