# 机器学习代考_Machine Learning代考_Theoretical justification

## 机器学习代考_Machine Learning代考_Theoretical justification

Data augmentation often significantly improves performance (predictive accuracy, robustness, etc). At first this might seem like we are getting something for nothing, since we have not provided additional data. However, the data augmentation mechanism can be viewed as a way to algorithmically inject prior knowledge.
To see this, recall that in standard ERM training, we minimize the empirical risk
$$R(f)=\int \ell(f(\boldsymbol{x}), \boldsymbol{y}) p^(\boldsymbol{x}, \boldsymbol{y}) d \boldsymbol{x} d \boldsymbol{y}$$ where we approximate $p^(\boldsymbol{x}, \boldsymbol{y})$ by the empirical distribution
$$p_{\mathcal{D}}(\boldsymbol{x}, \boldsymbol{y})=\frac{1}{N} \sum_{n=1}^N \delta\left(\boldsymbol{x}-\boldsymbol{x}n\right) \delta\left(\boldsymbol{y}-\boldsymbol{y}_n\right)$$ We can think of data augmentation as replacing the empirical distribution with the following algorithmically smoothed distribution $$p{\mathcal{D}}(\boldsymbol{x}, \boldsymbol{y} \mid A)=\frac{1}{N} \sum_{n=1}^N p\left(\boldsymbol{x} \mid \boldsymbol{x}_n, A\right) \delta\left(\boldsymbol{y}-\boldsymbol{y}_n\right)$$
where $A$ is the data augmentation algorithm, which generates a sample $\boldsymbol{x}$ from a training point $\boldsymbol{x}_n$, such that the label (“semantics”) is not changed. (A very simple example would be a Gaussian kernel, $p\left(\boldsymbol{x} \mid \boldsymbol{x}_n, A\right)=\mathcal{N}\left(\boldsymbol{x} \mid \boldsymbol{x}_n, \sigma^2 \mathbf{I}\right)$.) This has been called vicinal risk minimization [Cha $\left.+01\right]$, since we are minimizing the risk in the vicinity of each training point $\boldsymbol{x}$. For more details on this perspective, see [Zha+17b; CDL19; Dao+19].

## 机器学习代考_Machine Learning代考_Fine-tuning

Suppose, for now, that we already have a pretrained classifier, $p\left(y \mid x, \boldsymbol{\theta}_p\right)$, such as a CNN, that works well for inputs $\boldsymbol{x} \in \mathcal{X}_p$ (e.g. natural images) and outputs $y \in \mathcal{Y}_p$ (e.g., ImageNet labels), where the data comes from a distribution $p(\boldsymbol{x}, y)$ similar to the one used in training. Now we want to create a new model $q\left(y \mid \boldsymbol{x}, \boldsymbol{\theta}_q\right)$ that works well for inputs $\boldsymbol{x} \in \mathcal{X}_q$ (e.g. bird images) and outputs $y \in \mathcal{Y}_q$ (e.g., fine-grained bird labels), where the data comes from a distribution $q(\boldsymbol{x}, y)$ which may be different from $p$.

We will assume that the set of possible inputs is the same, so $\mathcal{X}_q \approx \mathcal{X}_p$ (e.g., both are RGB images), or that we can easily transform inputs from domain $p$ to domain $q$ (e.g., we can convert an RGB image to grayscale by dropping the chrominance channels and just keeping luminance). (If this is not the case, then we may need to use a method called domain adaptation, that modifies models to map between modalities, as discussed in Section 19.2.5.)

However, the output domains are usually different, i.e., $\mathcal{Y}_q \neq \mathcal{Y}_p$. For example, $\mathcal{Y}_p$ might be Imagenet labels and $\mathcal{Y}_q$ might be medical labels (e.g., types of diabetic retinopathy [Arc $\left.+19\right]$ ). In this case, we need to “translate” the output of the pre-trained model to the new domain. This is easy to do with neural networks: we simply “chop off” the final layer of the original model, and add a new “head” to model the new class labels, as illustrated in Figure 19.2. For example, suppose $p\left(y \mid \boldsymbol{x}, \boldsymbol{\theta}_p\right)=\mathcal{S}\left(y \mid \mathbf{W}_2 \boldsymbol{h}\left(\boldsymbol{x} ; \boldsymbol{\theta}_1\right)+\boldsymbol{b}_2\right)$, where $\boldsymbol{\theta}_p=\left(\mathbf{W}_2, \boldsymbol{b}_2, \boldsymbol{\theta}_1\right)$. Then we can construct $q\left(y \mid \boldsymbol{\theta}_q\right)=\mathcal{S}\left(y \mid \mathbf{W}_3 \boldsymbol{h}\left(\boldsymbol{x} ; \boldsymbol{\theta}_1\right)+\boldsymbol{b}_3\right)$, where $\boldsymbol{\theta}_q=\left(\mathbf{W}_3, \boldsymbol{b}_3, \boldsymbol{\theta}_1\right)$ and $\boldsymbol{h}\left(\boldsymbol{x} ; \boldsymbol{\theta}_1\right)$ is the shared nonlinear feature extractor.

After performing this “model surgery”, we can fine-tune the new model with parameters $\theta_q=$ $\left(\boldsymbol{\theta}_1, \boldsymbol{\theta}_3\right)$, where $\boldsymbol{\theta}_1$ parameterizes the feature extractor, and $\boldsymbol{\theta}_3$ parameterizes the final linear layer that maps features to the new set of labels. If we treat $\theta_1$ as “frozen parameters”, then the resulting model $q\left(y \mid \boldsymbol{x}, \boldsymbol{\theta}_q\right)$ is linear in its parameters, so we have a convex optimization problem for which many simple and efficient fitting methods exist (see Part II). This is particularly helpful in the long-tail setting, where some classes are very rare [Kan $+20]$. However, a linear “decoder” may be too limiting, so we can also allow $\boldsymbol{\theta}_1$ to be fine-tuned as well, but using a lower learning rate, to prevent the values moving too far from the values estimated on $\mathcal{D}_p$.

# 机器学习代考

## 机器学习代考_Machine Learning代考_Theoretical justification

$$\left.R(f)=\int \ell(f(\boldsymbol{x}), \boldsymbol{y}) p^{(\boldsymbol{x}}, \boldsymbol{y}\right) d \boldsymbol{x} d \boldsymbol{y}$$

$$p_{\mathcal{D}}(\boldsymbol{x}, \boldsymbol{y})=\frac{1}{N} \sum_{n=1}^N \delta(\boldsymbol{x}-\boldsymbol{x} n) \delta\left(\boldsymbol{y}-\boldsymbol{y}n\right)$$ 我们可以将数据增强视为用以下算法平滑分布代替经验分布 $$p \mathcal{D}(\boldsymbol{x}, \boldsymbol{y} \mid A)=\frac{1}{N} \sum{n=1}^N p\left(\boldsymbol{x} \mid \boldsymbol{x}_n, A\right) \delta\left(\boldsymbol{y}-\boldsymbol{y}_n\right)$$

## 机器学习代考_Machine Learning代考_Fine-tuning

myassignments-help数学代考价格说明

1、客户需提供物理代考的网址，相关账户，以及课程名称，Textbook等相关资料~客服会根据作业数量和持续时间给您定价~使收费透明，让您清楚的知道您的钱花在什么地方。

2、数学代写一般每篇报价约为600—1000rmb，费用根据持续时间、周作业量、成绩要求有所浮动(持续时间越长约便宜、周作业量越多约贵、成绩要求越高越贵)，报价后价格觉得合适，可以先付一周的款，我们帮你试做，满意后再继续，遇到Fail全额退款。

3、myassignments-help公司所有MATH作业代写服务支持付半款，全款，周付款，周付款一方面方便大家查阅自己的分数，一方面也方便大家资金周转，注意:每周固定周一时先预付下周的定金，不付定金不予继续做。物理代写一次性付清打9.5折。

Math作业代写、数学代写常见问题

myassignments-help擅长领域包含但不是全部: