# 计算机代写|机器学习代写Machine Learning代考|ENGG3300 The Loss

## 计算机代写|机器学习代写Machine Learning代考|The Loss

Every ML method uses a (more of less explicit) hypothesis space $\mathcal{H}$ which consists of all computationally feasible predictor maps $h$. Which predictor map $h$ out of all the maps in the hypothesis space $\mathcal{H}$ is the best for the ML problem at hand? To answer this questions, ML methods use the concept of a loss function. Formally, a loss function is a map
$$L: \mathcal{X} \times \mathcal{Y} \times \mathcal{H} \rightarrow \mathbb{R}_{+}:((\mathbf{x}, y), h) \mapsto L((\mathbf{x}, y), h)$$
which assigns a pair consisting of a data point, with features $\mathbf{x}$ and label $y$, and a hypothesis $h \in \mathcal{H}$ the non-negative real number $L((\mathbf{x}, y), h)$.

The loss value $L((\mathbf{x}, y), h)$ quantifies the discrepancy between the true label $y$ and the predicted label $h(\mathbf{x})$. A small (close to zero) value $L((\mathbf{x}, y), h)$ indicates a low discrepancy between predicted label and true label of a data point. Figure $2.11$ depicts a loss function for a given data point, with features $\mathbf{x}$ and label $y$, as a function of the hypothesis $h \in \mathcal{H}$. The basic principle of ML methods can then be formulated as: Learn (find) a hypothesis out of a given hypothesis space $\mathcal{H}$ that incurs a minimum loss $L((\mathbf{x}, y), h)$ for any data point (see Chap. 4).

Much like the choice for the hypothesis space $\mathcal{H}$ used in a ML method, also the loss function is a design choice. We will discuss some widely used examples for loss function in Sects. 2.3.1 and 2.3.2. The choice for the loss function should take into account the computational complexity of searching the hypothesis space for a hypothesis with minimum loss. Consider a ML method that uses a hypothesis space parametrized by a weight vector and a loss function that is a convex and differentiable (smooth) function of the weight vector. In this case, searching for a hypothesis with small loss can be done efficiently using the gradient-based methods discussed in Chap. 5. The minimization of a loss function that is either non-convex or non-differentiable is typically computationally much more difficult. Section $4.2$ discusses the computational complexities of different types of loss functions in more detail.

## 计算机代写|机器学习代写Machine Learning代考|Loss Functions for Numeric Labels

For ML problems involving data points with numeric labels $y \in \mathbb{R}$, i.e., for regression problems (see Sect. 2.1.2), a widely used (first) choice for the loss function can be the squared error loss
$$L((\mathbf{x}, y), h):=(y-\underbrace{h(\mathbf{x})}_{=\hat{y}})^2 .$$
The squared error loss (2.8) depends on the features $\mathbf{x}$ only via the predicted label value $\hat{y}=h(\mathbf{x})$. We can evaluate the squared error loss solely using the prediction $h(\mathbf{x})$ and the true label value $y$. Besides the prediction $h(\mathbf{x})$, no other properties of the features $\mathbf{x}$ are required to determine the squared error loss. We will slightly abuse notation and use the shorthand $L(y, \hat{y})$ for any loss function that depends on the features $\mathbf{x}$ only via the predicted label $\hat{y}=h(\mathbf{x})$. Figure $2.13$ depicts the squared error loss as a function of the prediction error $y-\hat{y}$.

The squared error loss $(2.8)$ has appealing computational and statistical properties. For linear predictor maps $h(\mathbf{x})=\mathbf{w}^T \mathbf{x}$, the squared error loss is a convex and differentiable function of the weight vector w. This allows, in turn, to efficiently search for the optimal linear predictor using efficient iterative optimization methods (see Chap. 5). The squared error loss also has a useful interpretation in terms of a probabilistic model for the features and labels. Minimizing the squared error loss is equivalent to maximum likelihood estimation within a linear Gaussian model [28, Sect. 2.6.3].

Another loss function used in regression problems is the absolute error loss $|\hat{y}-y|$. Using this loss function to guide the learning of a predictor results in methods that are robust against few outliers in the training set (see Sect. 3.3). However, this improved robustness comes at the expense of increased computational complexity of minimizing the (non-differentiable) absolute error loss compared to the (differentiable) squared error loss (2.8).

$$L: \mathcal{X} \times \mathcal{Y} \times \mathcal{H} \rightarrow \mathbb{R}{+}:((\mathbf{x}, y), h) \mapsto L((\mathbf{x}, y), h)$$ 它分配了一个由数据点组成的对，具有特征 $\mathbf{x}$ 和标签 $y$ ，和一个假设 $h \in \mathcal{H}$ 非负实数 $L((\mathbf{x}, y), h)$. 损失值 $L((\mathbf{x}, y), h)$ 量化真实标签之间的差异 $y$ 和预测的标签 $h(\mathbf{x})$. 一个小（接近于零) 的值 $L((\mathbf{x}, y), h)$ 表示数据点的预测标签 和真实标签之间的差异很小。数字 $2.11$ 描述给定数据点的损失函数，具有特征和标签 $y$, 作为假设的函数 $h \in \mathcal{H}$. ML 方法的基本 原理可以表述为: 从给定的假设空间中学习 (找到) 一个假设 $\mathcal{H}$ 导致最小损失 $L((\mathbf{x}, y), h)$ 对于任何数据点 (见第 4 章)。 很像假设空间的选择 $\mathcal{H}$ 在 ML 方法中使用，损失函数也是一种设计选择。我们将在 Sects 中讨论一些广泛使用的损失函数示例。 2.3.1 和 2.3.2。损失函数的选择应考虑在假设空间中搜索具有最小损失的假设的计算复杂性。考虑一种机器学习方法，该方法使用 由权重向量和损失函数参数化的假设空间，该损失函数是权重向量的凸和可微 (平滑) 函数。在这种情况下，可以使用第 1 章中讨 论的基于梯度的方法有效地㮴索具有小损失的假设。 5 . 非凸或不可微的损失函数的最小化通常在计算上要困难得多。部分 $4.2$ 更详 细地讨论了不同类型损失函数的计算复杂性。

