# 数学代写|机器学习代写machine learning代考|CS7641

## 数学代写|机器学习代写machine learning代考|Hard vs soft thresholding

The lasso objective has the form $\mathcal{L}(\boldsymbol{w})=\mathrm{NLL}(\boldsymbol{w})+\lambda|\boldsymbol{w}|_1$. One can show (Exercise 11.3) that the gradient for the smooth NLL part is given by
\begin{aligned} \frac{\partial}{\partial w_d} \mathrm{NLL}(\boldsymbol{w}) &=a_d w_d-c_d \ a_d &=\sum_{n=1}^N x_{n d}^2 \ c_d &=\sum_{n=1}^N x_{n d}\left(y_n-\boldsymbol{w}{-d}^{\top} \boldsymbol{x}{n,-d}\right) \end{aligned}
where $\boldsymbol{w}{-d}$ is $\boldsymbol{w}$ without component $d$, and similarly $\boldsymbol{x}{n,-d}$ is feature vector $\boldsymbol{x}n$ without component $d$. We see that $c_d$ is proportional to the correlation between $d$ ‘th column of features, $\boldsymbol{x}{:, d}$, and the residual error obtained by predicting using all the other features, $\boldsymbol{r}{-d}=\boldsymbol{y}-\mathbf{X}{:,-d} \boldsymbol{w}{-d}$. Hence the magnitude of $c_d$ is an indication of how relevant feature $d$ is for predicting $\boldsymbol{y}$, relative to the other features and the current parameters. Setting the gradient to 0 gives the optimal update for $w_d$, keeping all other weights fixed: $$w_d=c_d / a_d=\frac{\boldsymbol{x}{:, d}^{\top} r_{-d}}{\left|\boldsymbol{x}{:, d}\right|_2^2}$$ The corresponding new prediction for $\boldsymbol{r}{-d}$ becomes $\hat{\boldsymbol{r}}{-d}=w_d \boldsymbol{x}{:, d}$, which is the orthogonal projection of the residual onto the column vector $\boldsymbol{x}_{:, d}$, consistent with Equation (11.15).

Now we add in the $\ell_1$ term. Unfortunately, the $|\boldsymbol{w}|_1$ term is not differentiable whenever $w_d=0$. Fortunately, we can still compute a subgradient at this point. Using Equation (8.14) we find that
\begin{aligned} \partial_{w_d} \mathcal{L}(\boldsymbol{w}) &=\left(a_d w_d-c_d\right)+\lambda \partial_{w_d}|\boldsymbol{w}|_1 \ &=\left{\begin{array}{cl} \left{a_d w_d-c_d-\lambda\right} & \text { if } w_d<0 \\ {\left[-c_d-\lambda,-c_d+\lambda\right]} & \text { if } w_d=0 \\ \left\{a_d w_d-c_d+\lambda\right\} & \text { if } w_d>0 \end{array}\right. \end{aligned}

## 数学代写|机器学习代写machine learning代考|Regularization path

If $\lambda=0$, we get the OLS solution. which will be dense. As we increase $\lambda$, the solution vector $\hat{\boldsymbol{w}}(\lambda)$ will tend to get sparser. If $\lambda$ is bigger than some critical value, we get $\hat{\boldsymbol{w}}=\mathbf{0}$. This critical value is obtained when the gradient of the NLL cancels out with the gradient of the penalty:
$$\lambda_{\max }=\max d\left|\nabla{w_d} \mathrm{NLL}(\mathbf{0})\right|=\max d c_d(\boldsymbol{w}=0)=\max _d\left|\boldsymbol{y}^{\top} \boldsymbol{x}{:, d}\right|=\left|\mathbf{X}^{\top} \boldsymbol{y}\right|_{\infty}$$
Alternatively, we can work with the bound $B$ on the $\ell_1$ norm. When $B=0$, we get $\hat{\boldsymbol{w}}=\mathbf{0}$. As we increase $B$, the solution becomes denser. The largest value of $B$ for which any component is zero is given by $B_{\max }=\left|\hat{\boldsymbol{w}}_{\mathrm{mle}}\right|_1$.

As we increase $\lambda$, the solution vector $\hat{\boldsymbol{w}}$ gets sparser, although not necessarily monotonically. We can plot the values $\hat{w}_d$ vs $\lambda$ (or vs the bound $B$ ) for each feature $d$; this is known as the regularization path. This is illustrated in Figure 11.10(b), where we apply lasso to the prostate cancer regression dataset from [HTF09]. (We treat features gleason and svi as numeric, not categorical.) On the left,

when $B=0$, all the coefficients are zero. As we increase $B$, the coefficients gradually “turn on”. ${ }^2$ The analogous result for ridge regression is shown in Figure 11.10(a). For ridge, we see all coefficients are non-zero (assuming $\lambda>0$ ), so the solution is not sparse.

Remarkably, it can be shown that the lasso solution path is a piecewise linear function of $\lambda[\mathrm{Efr}+04$; GL15]. That is, there are a set of critical values of $\lambda$ where the active set of non-zero coefficients changes. For values of $\lambda$ between these critical values, each non-zero coefficient increases or decreases in a linear fashion. This is illustrated in Figure 11.10(b). Furthermore, one can solve for these critical values analytically [Efr+04]. In Table 11.1. we display the actual coefficient values at each of these critical steps along the regularization path (the last line is the least squares solution).

By changing $\lambda$ from $\lambda_{\max }$ to 0 , we can go from a solution in which all the weights are zero to a solution in which all weights are non-zero. Unfortunately, not all subset sizes are achievable using lasso. In particular, one can show that, if $D>N$, the optimal solution can have at most $N$ variables in it, before reaching the complete set corresponding to the OLS solution of minimal $\ell_1$ norm. In Section 11.4.8, we will see that by using an $\ell_2$ regularizer as well as an $\ell_1$ regularizer (a method known as the elastic net), we can achieve sparse solutions which contain more variables than training cases. This lets us explore model sizes between $N$ and $D$.

# 机器学习代考

## 数学代写|机器学习代写machine learning代考|Hard vs soft thresholding

$$w_d=c_d / a_d=\frac{\boldsymbol{x}:, d^{\top} r_{-d}}{|\boldsymbol{x}:, d|2^2}$$ 对应的新预测为 $\boldsymbol{r}-d$ 变成 $\hat{r}-d=w_d \boldsymbol{x}:, d$ ，这是残差在列向量上的正交投影 $\boldsymbol{x}{:, d}$ ，与方程 (11.15) $-$ 致。

\lambda_{\max }=\max d\left|\nabla w_d \mathrm{NLL}(\mathbf{0})\right|=\max d c_d(\boldsymbol{w}=0)=\max d\left|\boldsymbol{y}^{\top} \boldsymbol{x}:, d\right|=\left|\mathbf{X}^{\top} \boldsymbol{y}\right|{\infty}


myassignments-help数学代考价格说明

1、客户需提供物理代考的网址，相关账户，以及课程名称，Textbook等相关资料~客服会根据作业数量和持续时间给您定价~使收费透明，让您清楚的知道您的钱花在什么地方。

2、数学代写一般每篇报价约为600—1000rmb，费用根据持续时间、周作业量、成绩要求有所浮动(持续时间越长约便宜、周作业量越多约贵、成绩要求越高越贵)，报价后价格觉得合适，可以先付一周的款，我们帮你试做，满意后再继续，遇到Fail全额退款。

3、myassignments-help公司所有MATH作业代写服务支持付半款，全款，周付款，周付款一方面方便大家查阅自己的分数，一方面也方便大家资金周转，注意:每周固定周一时先预付下周的定金，不付定金不予继续做。物理代写一次性付清打9.5折。

Math作业代写、数学代写常见问题

myassignments-help擅长领域包含但不是全部: