# 数学代写|机器学习代写machine learning代考|CS446

## 数学代写|机器学习代写machine learning代考|Variable selection consistency

It is common to use $\ell_1$ regularization to estimate the set of relevant variables, a process known as variable selection. A method that can recover the true set of relevant variables (i.e., the support of $\boldsymbol{w}^*$ ) in the $N \rightarrow \infty$ limit is called model selection consistent. (This is a theoretical notion that assumes the data comes from the model.)

Let us give an example. We first generate a sparse signal $\boldsymbol{w}^$ of size $D=4096$, consisting of 160 randomly placed $\pm 1$ spikes. Next we generate a random design matrix $\mathbf{X}$ of size $N \times D$, where $N=1024$. Finally we generate a noisy observation $\boldsymbol{y}=\mathbf{X} \boldsymbol{w}^+\epsilon$, where $\epsilon_n \sim \mathcal{N}\left(0,0.01^2\right)$. We then estimate $\boldsymbol{w}$ from $\boldsymbol{y}$ and $\mathbf{X}$. The original $\boldsymbol{w}^*$ is shown in the first row of Figure 11.13. The second row is the $\ell_1$ estimate $\hat{\boldsymbol{w}}{L 1}$ using $\lambda=0.1 \lambda{\max }$. We see that this has “spikes” in the right places, so it has correctly identified the relevant variables. However, although we see that $\hat{\boldsymbol{w}}{L 1}$ has correctly identified the non-zero components, but they are too small, due to shrinkage. In the third row, we show the results of using the debiasing technique discussed in Section 11.4.3. This shows that we can recover the original weight vector. By contrast, the final row shows the OLS estimate, which is dense. Furthermore, it is visually clear that there is no single threshold value we can apply to $\hat{\boldsymbol{w}}{\text {mle }}$ to recover the correct sparse weight vector.

To use lasso to perform variable selection, we have to pick $\lambda$. It is common to use cross validation to pick the optimal value on the regularization path. However, it is important to note that cross validation is picking a value of $\lambda$ that results in good predictive accuracy. This is not usually the same value as the one that is likely to recover the “true” model. To see why, recall that $\ell_1$ regularization performs sclection and shrinkage, that is, the chosen coefficients are brought closer to 0 . In order to prevent relevant coefficients from being shrunk in this way, cross validation will tend to pick a value of $\lambda$ that is not too large. Of course, this will result in a less sparse model which contains irrelevant variables (false positives). Indeed, it was proved in [MB06] that the prediction-optimal value of $\lambda$ does not result in model selection consistency. However, various extensions to the basic method have been devised that are model selection consistent (see e.g., [BG11; HTW15]).

## 数学代写|机器学习代写machine learning代考|Penalizing the two-norm

To encourage group sparsity, we partition the parameter vector into $G$ groups, $\boldsymbol{w}=\left[\boldsymbol{w}1, \ldots, \boldsymbol{w}_G\right]$. Then we minimize the following objective $$\operatorname{PNLL}(\boldsymbol{w})=\mathrm{NLL}(\boldsymbol{w})+\lambda \sum{g=1}^G\left|\boldsymbol{w}g\right|_2$$ where $\left|\boldsymbol{w}_g\right|_2=\sqrt{\sum{d \in g} w_d^2}$ is the 2-norm of the group weight vector. If the NLL is least squares, this method is called group lasso [YL06; Kyu $+10]$.

Note that if we had used the sum of the squared 2-norms in Equation (11.97), then the model would become equivalent to ridge regression, since
$$\sum_{g=1}^G\left|\boldsymbol{w}g\right|_2^2=\sum_g \sum{d \in g} w_d^2=|\boldsymbol{w}|_2^2$$
By using the square root, we are penalizing the radius of a ball containing the group’s weight vector: the only way for the radius to be small is if all elements are small.

Another way to see why the square root version enforces sparsity at the group level is to consider the gradient of the objective. Suppose there is only one group of two variables, so the penalty has the form $\sqrt{w_1^2+w_2^2}$. The derivative wrt $w_1$ is
$$\frac{\partial}{\partial w_1}\left(w_1^2+w_2^2\right)^{\frac{1}{2}}=\frac{w_1}{\sqrt{w_1^2+w_2^2}}$$
If $w_2$ is close to zero, then the derivative approaches 1 , and $w_1$ is driven to zero as well, with force proportional to $\lambda$. If, however, $w_2$ is large, the derivative approaches 0 , and $w_1$ is free to stay large as well. So all the coefficients in the group will have similar size.

# 机器学习代考

## 数学代写|机器学习代写machine learning代考|Penalizing the two-norm

$$\frac{\partial}{\partial w_1}\left(w_1^2+w_2^2\right)^{\frac{1}{2}}=\frac{w_1}{\sqrt{w_1^2+w_2^2}}$$

myassignments-help数学代考价格说明

1、客户需提供物理代考的网址，相关账户，以及课程名称，Textbook等相关资料~客服会根据作业数量和持续时间给您定价~使收费透明，让您清楚的知道您的钱花在什么地方。

2、数学代写一般每篇报价约为600—1000rmb，费用根据持续时间、周作业量、成绩要求有所浮动(持续时间越长约便宜、周作业量越多约贵、成绩要求越高越贵)，报价后价格觉得合适，可以先付一周的款，我们帮你试做，满意后再继续，遇到Fail全额退款。

3、myassignments-help公司所有MATH作业代写服务支持付半款，全款，周付款，周付款一方面方便大家查阅自己的分数，一方面也方便大家资金周转，注意:每周固定周一时先预付下周的定金，不付定金不予继续做。物理代写一次性付清打9.5折。

Math作业代写、数学代写常见问题

myassignments-help擅长领域包含但不是全部: