## 机器学习代考_Machine Learning代考_Low-rank and kernel methods

In this section, we discuss methods that approximate attention using low rank matrices. In [She $+18$; Kat $+20$ ] they approximate the attention matrix $\mathbf{A}$ directly by a low rank matrix, so that
$$A_{i j}=\phi\left(q_i\right)^{\top} \phi\left(\boldsymbol{k}_j\right)$$
where $\phi(x) \in \mathbb{R}^M$ is some finite-dimensional vector with $M<D$. One can leverage this structure to compute AV in $O(N)$ time. Unfortunately, for softmax attention, the A is not low rank.

In Linformer |Wan $+20 \mathrm{a}$ |, they instead transform the keys and values via random Gaussian projections. They then apply the theory of the Johnson-Lindenstrauss Transform [AL13] to approximate softmax attention in this lower dimensional space.

In Performer [Cho $+20 \mathrm{a}$; Cho $+20 \mathrm{~b}$ ]. they show that the attention matrix can be computed using a (positive definite) kernel function. We define kernel functions in Section 17.1, but the basic idea is that $\mathcal{K}(\boldsymbol{\varphi}, \boldsymbol{k}) \geq 0$ is sune measure of similarity between $\boldsymbol{q} \in \mathbb{R}^n$ and $\boldsymbol{k} \in \mathbb{R}^n$. For example, the Gaussian kernel, also called the radial basis function kernel, has the form
$$\mathcal{K}{\text {gauss }}(\boldsymbol{q}, \boldsymbol{k})=\exp \left(-\frac{1}{2 \sigma^2}|\boldsymbol{q}-\boldsymbol{k}|_2^2\right)$$ To see how this can be used to compute an attention matrix, note that [Cho $+20$ a] show the following: $$A{i, j}=\exp \left(\frac{\boldsymbol{q}i^{\top} \boldsymbol{k}_j}{\sqrt{D}}\right)=\exp \left(\frac{-\left|\boldsymbol{q}_i-\boldsymbol{k}_j\right|_2^2}{2 \sqrt{D}}\right) \times \exp \left(\frac{\left|\boldsymbol{q}_i\right|_2^2}{2 \sqrt{D}}\right) \times \exp \left(\frac{\left|_j\right|_2^2}{2 \sqrt{D}}\right)$$ The first term in the above expression is equal to $K{\text {gauss }}\left(\boldsymbol{q}_i D^{-1 / 4}, k_j D^{-1 / 4}\right)$ with $\sigma=1$, and the other two terms are just independent scaling factors.

So far we have not gained anything computationally. However, we will show in Section $17.2 .9 .3$ that the Gaussian kernel can be written as the expectation of a set of random features:
$$\mathcal{K}_{\text {gauss }}(\boldsymbol{x}, \boldsymbol{y})=\mathbb{E}\left[\boldsymbol{\eta}(\boldsymbol{x})^{\boldsymbol{\top}} \boldsymbol{\eta}(\boldsymbol{y})\right]$$
where $\eta(\boldsymbol{x}) \in \mathbb{1}^M$ is a random feature vector derived from $\boldsymbol{x}$, either based on trigonometric functions Equation (17.60) or exponential functions Equation (17.61).

## 机器学习代考_Machine Learning代考_Language models and unsupervised representation learning

We have discussed how RNNs and autoregressive (decoder-only) transformers can be used as language models, which are generative sequence models of the form $p\left(x_1, \ldots, x_T\right)=\prod_{t=1}^T p\left(x_t \mid \boldsymbol{x}_{1: t-1}\right)$, where each $x_t$ is a discrete token, such as a word or wordpiece. (See Section $1.5 .4$ for a discussion of text preprocessing methods.) The latent state of these models can then be used as a continuous vector representation of the text. That is, instead of using the one-hot vector $\boldsymbol{x}_t$, or a learned embedding of it (such as those discussed in Section 20.5), we use the hidden state $\boldsymbol{h}_t$, which depends on all the previous words in the sentence. These vectors can then be used as contextual word embeddings, for purposes such as text classification or seq2seq tasks (see e.g. [LKB20] for a review). The advantage of this approach is that we can pre-train the language model in an unsupervised way, on a large corpus of text, and then we can fine-tune the model in a supervised way on a small labeled task-specific dataset. (This general approach is called transfer learning, see Section $19.2$ for details.)

If our primary goal is to compute useful representations for transfer learning, as opposed to generating text, we can replace the generative sequence model with non-causal models that can compute a representation of a sentence, but cannot generate it. These models have the advantage that now the hidden state $\boldsymbol{h}t$ can depend on the past, $\boldsymbol{y}{1: t-1}$, present $\boldsymbol{y}t$, and future, $\boldsymbol{y}{t+1: T}$. This can sometimes result in better representations, since it takes into account more context.

In the sections below, we briefly discuss some unsupervised models for representation learning on text, using both causal and non-causal models.

# 机器学习代考

$$A_{i j}=\phi\left(q_i\right)^{\top} \phi\left(\boldsymbol{k}j\right)$$ 在哪里 $\phi(x) \in \mathbb{R}^M$ 是一些有限维向量 $M{\text {gauss }}(\boldsymbol{q}, \boldsymbol{k})=\exp \left(-\frac{1}{2 \sigma^2}|\boldsymbol{q}-\boldsymbol{k}|2^2\right) $$要了解如何使用它来计算注意力矩阵，请注意 [Cho +20 a] 显示以下内容:$$ A i, j=\exp \left(\frac{\boldsymbol{q} i^{\top} \boldsymbol{k}_j}{\sqrt{D}}\right)=\exp \left(\frac{-\left|\boldsymbol{q}_i-\boldsymbol{k}_j\right|_2^2}{2 \sqrt{D}}\right) \times \exp \left(\frac{\left|\boldsymbol{q}_i\right|_2^2}{2 \sqrt{D}}\right) \times \exp \left(\frac{|j|_2^2}{2 \sqrt{D}}\right) $$上述表达式中的第一项等于 K gauss \left(q_i D^{-1 / 4}, k_j D^{-1 / 4}\right) 和 \sigma=1 ，另外两项只是独立的比例因子。 到目前为止，我们还没有在计算上获得任何东西。但是，我们将在第 17.2 .9 .3 高斯核可以写成对一组随 机特征的期望:$$ \mathcal{K}{\text {gauss }}(\boldsymbol{x}, \boldsymbol{y})=\mathbb{E}\left[\boldsymbol{\eta}(\boldsymbol{x})^{\top} \boldsymbol{\eta}(\boldsymbol{y})\right]$$在哪里$\eta(\boldsymbol{x}) \in 1^M$是从中导出的随机特征向量$\boldsymbol{x}$，基于三角函数方程 (17.60) 或指数函数方程$(17.61)_{\circ}$## 机器学习代考_Machine Learning代考_Language models and unsupervised representation learning 我们已经讨论了 RNN 和自回归 (仅解码器) 转换器如何用作语言模型，它们是形式的生成序列模型$p\left(x_1, \ldots, x_T\right)=\prod_{t=1}^T p\left(x_t \mid \boldsymbol{x}_{1: t-1}\right)$，其中每个$x_t$是一个离散的标记，例如一个词或词块。（见节 1.5.4用于讨论文本预处理方法。) 然后可以将这些模型的潜在状态用作文本的连续向量表示。也就是 说，而不是使用 one-hot vector$\boldsymbol{x}_t$，或者它的学习嵌入 (例如第$20.5$节中讨论的那些)，我们使用隐藏 状态$\boldsymbol{h}_t$，这取决于句子中所有前面的词。这些向量然后可以用作上下文词嵌入，用于文本分类或 seq2seq 任务等目的（参见 [LKB20] 的评论) 。这种方法的优点是我们可以在大型文本语料库上以无监 督的方式预训练语言模型，然后我们可以在小型标记的特定任务数据集上以监督方式微调模型。（这种 通用方法称为迁移学习，参见第$19.2$了解详情。) 如果我们的主要目标是为迁移学习计算有用的表示，而不是生成文本，我们可以用非因果模型替换生成 序列模型，这些模型可以计算句子的表示，但不能生成它。这些模型的优点是现在隐藏状态$\boldsymbol{h} t$可以依赖 过去，$\boldsymbol{y} 1: t-1$，当下$\boldsymbol{y}$, 和末来$\boldsymbol{y} \boldsymbol{y}+1: T\$. 这有时会产生更好的表示，因为它考虑了更多的上下 文

myassignments-help数学代考价格说明

1、客户需提供物理代考的网址，相关账户，以及课程名称，Textbook等相关资料~客服会根据作业数量和持续时间给您定价~使收费透明，让您清楚的知道您的钱花在什么地方。

2、数学代写一般每篇报价约为600—1000rmb，费用根据持续时间、周作业量、成绩要求有所浮动(持续时间越长约便宜、周作业量越多约贵、成绩要求越高越贵)，报价后价格觉得合适，可以先付一周的款，我们帮你试做，满意后再继续，遇到Fail全额退款。

3、myassignments-help公司所有MATH作业代写服务支持付半款，全款，周付款，周付款一方面方便大家查阅自己的分数，一方面也方便大家资金周转，注意:每周固定周一时先预付下周的定金，不付定金不予继续做。物理代写一次性付清打9.5折。

Math作业代写、数学代写常见问题

myassignments-help擅长领域包含但不是全部: