## 计算机代写|机器学习代写Machine Learning代考|Deep Learning

Another example of a hypothesis space uses a signal-flow representation of a hypothesis map $h: \mathbb{R}^n \rightarrow \mathbb{R}$. This signal-flow representation is referred to as artificial neural network. Figure $3.8$ depicts an example for a artificial neural network that is used to represent a (parameterized) hypothesis $h^{(\mathbf{w})}: \mathbb{R}^n \rightarrow \mathbb{R}$. A feature vector $\mathbf{x} \in \mathbb{R}^n$ is fed into the input units, each of which reads in one single feature $x_j \in \mathbb{R}$. The features $x_j$ are then multiplied with the weights $w_{j, j^{\prime}}$ associated with the link between the $j$ th input node (“neuron”) with the $j^{\prime}$ th node in the middle (hidden) layer. The output of the $j^{\prime}$-th node in the hidden layer is given by $s_{j^{\prime}}=g\left(\sum_{j=1}^n w_{j, j^{\prime}} x_j\right)$ with some (typically non-linear) activation function $f: \mathbb{R} \rightarrow \mathbb{R}$. The input argument to the activation function is the weighted combination $\sum_{j=1}^n w_{j, j^{\prime}} s_{j^{\prime}}$ of the outputs $s_j$ of the nodes in a previous layer. For the artificial neural network depicted in Fig. 3.11, the output of neuron $s_1$ is $f(z)$ with $z=w_{1,1} x_1+w_{1,2} x_2$.

Two popular choices for the activation function used within artificial neural networks are the sigmoid function $f(z)=\frac{1}{1+\exp (-z)}$ or the deep net $f(z)=\max {0, z}$. Artificial neural networks using many, say 10 , hidden layers, is often referred to as a deep net. ML methods using hypothesis spaces obtained from deep nets are known as deep learning methods [7].

Remarkably, using some simple non-linear activation function $f(z)$ as the building block for artificial neural networks allows us to represent an extremely large class of predictor maps $h^{(\mathbf{w})}: \mathbb{R}^n \rightarrow \mathbb{R}$. The hypothesis space generated by a given artificial neural network structure, i.e., the set of all predictor maps which can be implemented by a given artificial neural network and suitable weights $\mathbf{w}$, tends to be much larger than the hypothesis space (2.4) of linear predictors using weight vectors $\mathbf{w}$ of the same length [7, Chap. 6.4.1.]. It can be shown that an artificial neural network with only one single (but arbitrarily large) hidden layer can approximate any given map $h: \mathcal{X} \rightarrow \mathcal{Y}=\mathbb{R}$ to any desired accuracy [8]. However, a key insight which underlies many deep learning methods is that using several layers with few neurons, instead of one single layer containing many neurons, is computationally favourable [9].

## 计算机代写|机器学习代写Machine Learning代考|Maximum Likelihood

For many applications it is useful to model the observed datapoints $\mathbf{z}^{(i)}$, with $i=$ $1, \ldots, m$, as i.i.d. realizations of a random variable $\mathbf{z}$ with probability distribution $p(\mathbf{z} ; \mathbf{w})$. This probability distribution is parameterized in the sense of depending on a weight vector $\mathbf{w} \in \mathbb{R}^n$. A principled approach to estimating the vector $\mathbf{w}$ based on a set of i.i.d. realizations $\mathbf{z}^{(1)}, \ldots, \mathbf{z}^{(m)} \sim p(\mathbf{z} ; \mathbf{w})$ is maximum likelihood estimation [10].

Maximum likelihood estimation can be interpreted as an ML problem with a hypothesis space parameterized by the weight vector $\mathbf{w}$, i.e., each element $h^{(\mathbf{w})}$ of the hypothesis space $\mathcal{H}$ corresponds to one particular choice for the weight vector $\mathbf{w}$, and the loss function
$$L\left(\mathbf{z}, h^{(\mathbf{w})}\right):=-\log p(\mathbf{z} ; \mathbf{w}) .$$
A widely used choice for the probability distribution $p(\mathbf{z} ; \mathbf{w})$ is a multivariate normal (Gaussian) distribution with mean $\boldsymbol{\mu}$ and covariance matrix $\Sigma$, both of which constitute the weight vector $\mathbf{w}=(\boldsymbol{\mu}, \Sigma$ ) (we have to reshape the matrix $\Sigma$ suitably into a vector form). Given the i.i.d. realizations $\mathbf{z}^{(1)}, \ldots, \mathbf{z}^{(m)} \sim p(\mathbf{z} ; \mathbf{w})$, the maximum likelihood estimates $\hat{\boldsymbol{\mu}}, \widehat{\Sigma}$ of the mean vector and the covariance matrix are obtained via
$$\hat{\boldsymbol{\mu}}, \widehat{\Sigma}=\underset{\boldsymbol{\mu} \in \mathbb{R}^n, \Sigma \in \mathbb{S}{+}^n}{\operatorname{argmin}}(1 / m) \sum{i=1}^m-\log p\left(\mathbf{z}^{(i)} ;(\boldsymbol{\mu}, \Sigma)\right) .$$

