## 计算机代写|机器学习代写Machine Learning代考|Learning

Given a data set $\mathbf{y}{1: N}$, where each data point is assumed to be drawn independently from the model, we learn the model parameters, $\theta$, by minimizing the negative log-likelihood of the data: \begin{aligned} \mathcal{L}(\theta) & =-\ln p\left(\mathbf{y}{1: N} \mid \theta\right) \ & =-\sum_i \ln p\left(\mathbf{y}_i \mid \theta\right) \end{aligned}
Note that this is a constrained optimization, since we require $a_j \geq 0$ and $\sum_j a_j=1$. Furthermore, $\mathbf{K}_j$ must be symmetric, positive-definite matrix to be a covariance matrix. Unfortunately, this optimization cannot be performed in closed-form.

One approach is to use gradient descent to optimization by gradient descent. There are a few issues associated with doing so. First, some care is required to avoid numerical issues, as discussed below. Second, this learning is a constrained optimization, due to constraints on the values of the $a$ ‘s. One solution is to project onto the constraints during optimization: at each gradient descent step (and inside the line search loop), we clamp all negative $a$ values to zero and renormalize the $a$ ‘s so that they sum to one. Another option is to reparameterize the problem to be unconstrained. Specifically, we define new variables $\beta_j$, and define the $a$ ‘s as functions of the $\beta$ s, e.g.,
$$a_j(\beta)=\frac{e^{\beta_j}}{\sum_{j=1}^K e^{\beta_j}}$$

## 计算机代写|机器学习代写Machine Learning代考|Numerical issues

Exponentiating very small negative numbers can often lead to underflow when implemented in floating-point arithmetic, e.g., $e^{-A}$ will give zero for large $A$, and $\ln e^{-A}$ will give an error (or $-\operatorname{In} f$ ) whereas it should return $-A$. These issues will often cause machine learning algorithms to fail; MoG has several steps which are susceptible. Fortunately, there are some simple tricks that can be used.

Many computations can be performed directly in the log domain. For example, it may be more stable to compute

$a e^b$

$(360)$

as

$$e^{\ln a+b}$$

This avoids issues where $b$ is so small that $e^b$ evaluates to zero in floating point, but $a e^b$ is much greater than zero.

When computing an expression of the form:
$$\frac{e^{-\beta_j}}{\sum_j e^{-\beta_j}}$$
large values of $\beta$ could lead to the above expression being zero for all $j$, even though the expression must sum to one. This may arise, for example, when computing the $\gamma$ updates, which have the above form. The solution is to make use of the identity:
$$\frac{e^{-\beta_j}}{\sum_j e^{-\beta_j}}=\frac{e^{-\beta_j+C}}{\sum_j e^{-\beta_j+C}}$$
for any value of $C$. We can choose $C$ to prevent underflow; a suitable choice is $C=\min _j \beta_j$.

Underflow can also occur when evaluating
$$\ln \sum_i e^{-\beta_j}$$
which can be fixed by using the identity
$$\ln \sum_i e^{-\beta_j}=\left(\ln \sum_i e^{-\beta_j+C}\right)-C$$

