## 计算机代写|机器学习代写machine learning代考|Using and Evaluating Classifiers for Ranking

Often, the goal of training a classifier is not merely to generate exhaustive sets of ‘true’ and ‘false’ instances. For example, if we wanted to identify relevant wehpages in response to a query, or to recommend items that a user is likely to purchase, in practice it may not matter whether we can identify all relevant webpages or products; rather, we might care more about whether we can surface some relevant items among the first page of results returned to a user.
Note that the type of classifiers we have developed so far can straightforwardly be used for ranking. That is, in addition to outputting a predicted label $\left(\delta\left(x_i \cdot \theta>0\right)\right.$ in the case of logistic regression), they can also output confidence scores (i.e., $x_i \cdot \theta$, or $\left.p_\theta\left(y_i=1 \mid x_i\right)\right)$. Thus, in the context of finding relevant webpages or products above, our goal might be to maximize the number of relevant items returned among the few most confident predictions. Furthermore, we might be interested in how the model’s accuracy changes as a function of confidence; for example, even if the model’s accuracy is low overall, is it accurate for the top $1 \%, 5 \%$, or $10 \%$ of most confident predictions?

Note that neither precision nor recall are particularly meaningful if reported in isolation. For instance, it is trivial to achieve a recall of $1.0$ simply by using a classifier that returns ‘true’ for every item (in which case, all relevant documents are returned); such a classifier would of course have low precision. Likewise, a precision close to $1.0$ can often be achieved by returning ‘true’ only for a few items about which we are extremely confident; such a classifier would have low recall.

As such, to evaluate a classifier in terms of precision and recall, we likely want a metric that considers both, or otherwise to place additional constraints on our classifier (as we see below).

The $F_\beta$ score achieves this by taking a weighted average of the two quantities:
$$F_\beta=\left(1+\beta^2\right) \cdot \frac{\text { precision } \cdot \text { recall }}{\beta^2 \text { precision }+\text { recall }} .$$
In the case of $\beta=1$ (which is normally called simply the ‘ $F$-score’), Equation (3.27) simply computes the harmonic mean of precision and recall, which is low if either precision or recall is low.

Otherwise, if $\beta \neq 1$, the $F_\beta$ score reflects a situation where one cares about recall over precision by a factor of $\beta .{ }^5$

## 计算机代写|机器学习代写machine learning代考|Generalization, Overfitting, and Underfitting

So far, when discussing model evaluation in Section $3.3$ (and earlier in sec. 2.2), we have considered training a model to predict labels $y$ from a dataset $X$; we have then evaluated the model by comparing the predictions $f\left(x_i\right)$ to the labels $y_i$. Critically, we’re using the same data to train the model as we’re using to evaluate it.

The risk in doing so is that our model may not generalize well to new data. For example, when fitting a model relating review length to ratings (as in figs. $2.4$ and 2.8), we considered fitting the data with linear, quadratic, and cubic functions. Increasing the degree of the polynomial would continue to lower the errors of the predictor; alternately, we could have modeled review length using a one-hot encoding (so that there was a different predicted value for every length). Such models could fit the data very closely (in terms of their MSE), but it is unclear whether they would capture meaningful trends in the data or simply ‘memorize’ it.

To consider an extreme case, imagine fitting a vector $y$ using only random features. The code below fits a vector of fifty observations using $1,10,25$, and 50 random features, and then prints the $R^2$ coefficient of each model: Here, the $R^2$ coefficients take values of $0.07,0.25,0.35$, and $1.0$ – once we include fifty random features, we can fit the data perfectly. Of course, given that our features were random, this ‘fit’ is not meaningful, and the model has merely discovered random correlations between the observed data and labels.

These arguments point to two issues that need to be addressed when training a model:
(i) We should not evaluate a model on the same data that was used to train it. Rather we should use a held-out dataset (i.e., a test set).
(ii) Features that improve performance on the training data will not necessarily improve performance on the held-out data.

Evaluating a model on held-out data gives us a sense of how well we can expect that model to work ‘in the wild.’ This held-out data, known as a test set, measures how well our model can be expected to generalize to new data.

# 机器学习代考

## 计算机代写|机器学习代写machine learning代考|使用和评估分类器进行排名

$F_\beta$分数是通过取两个量的加权平均值来实现的:
$$F_\beta=\left(1+\beta^2\right) \cdot \frac{\text { precision } \cdot \text { recall }}{\beta^2 \text { precision }+\text { recall }} .$$

## 计算机代写|机器学习代写machine learning代考|概化，过拟合，和欠拟合

.

(ii)提高训练数据性能的特性并不一定会提高在保留数据上的性能

myassignments-help数学代考价格说明

1、客户需提供物理代考的网址，相关账户，以及课程名称，Textbook等相关资料~客服会根据作业数量和持续时间给您定价~使收费透明，让您清楚的知道您的钱花在什么地方。

2、数学代写一般每篇报价约为600—1000rmb，费用根据持续时间、周作业量、成绩要求有所浮动(持续时间越长约便宜、周作业量越多约贵、成绩要求越高越贵)，报价后价格觉得合适，可以先付一周的款，我们帮你试做，满意后再继续，遇到Fail全额退款。

3、myassignments-help公司所有MATH作业代写服务支持付半款，全款，周付款，周付款一方面方便大家查阅自己的分数，一方面也方便大家资金周转，注意:每周固定周一时先预付下周的定金，不付定金不予继续做。物理代写一次性付清打9.5折。

Math作业代写、数学代写常见问题

myassignments-help擅长领域包含但不是全部: