# 统计代写|贝叶斯分析代考Bayesian Analysis代写|The Dirichlet Distribution and Sparsity

A symmetric Dirichlet distribution (Section 2.2.1) is hyperparametrized by $\alpha>0$. It is a specific case of the Dirichlet distribution in which the hyperparameter vector of the general Dirichlet distribution contains only identical values to $\alpha$. When the hyperparameter of a symmetric Dirichlet distribution $\alpha \in \mathbb{R}$ is chosen such that $\alpha<1$, any point $x \in \mathbb{R}^K$ drawn from the respective Dirichlet will have most of its coordinates close to 0 , and only a few will have a value significantly larger than zero.

The intuition behind this property of the symmetric Dirichlet distribution can be understood when inspecting the main term in the density of the Dirichlet distribution: $\prod_{i=1}^K \theta_i^{\alpha-1}$. When $\alpha<1$, this product becomes $\frac{1}{\prod_{i=1}^K \theta_i^\beta}$ for $0<\beta=\alpha-1$. Clearly, this product becomes very large if one of the $\theta_i$ is close to 0 . If many of the $\theta_i$ are close to 0 , this effect is multiplied, which makes the product even larger. It is therefore true that most of the density for the symmetric Dirichlet with $\alpha<1$ is concentrated around points in the probability simplex where the majority of the $\theta_i$ are close to 0 .

This property of the symmetric Dirichlet has been exploited consistently in the Bayesian NLP literature. For example, Goldwater and Griffiths (2007) defined a Bayesian part-of-speech tagging with hidden Markov models (Chapter 8), in which they used a Dirichlet prior as a prior over the set of multinomials for the transition probabilities and emission probabilities in the trigram hidden Markov model.

For the first set of experiments, Goldwater and Griffiths used a fixed sparse hyperparameter for all transition probabilities and a fixed, different hyperparameter for all emission probabilities. Their findings show that choosing a small value for the transition hyperparameter (0.03) together with a choice of hyperparameter 1 for the emission probabilities achieves the best prediction accuracy of the part-of-speech tags. This means that the optimal transition multinomials are similarly likely to be very sparse. This is not surprising, since only a small number of part-of-speech tags can appear in a certain context. However, the emission hyperparameter 1 means that the Dirichlet distribution is simply a uniform distribution. The authors argued that the reason a sparse prior was not very useful for the emission probabilities is that all emission probabilities shared the same hyperparameter.

## 统计代写|贝叶斯分析代考Bayesian Analysis代写|Gamma Representation of the Dirichlet

The Dirichlet distribution has a reductive representation to the Gamma distribution. This representation does not contribute directly to better modeling, but helps to demonstrate the limi- tations of the Dirichlet distribution, and suggest alternatives to it (such as the one described in the next section).

Let $\mu_i \sim \Gamma\left(\alpha_i, 1\right)$ be $K$ i.i.d. random variables distributed according to the Gamma distribution with shape $\alpha_i>0$ and scale 1 (see also Appendix B). Then, the definition of
$$\theta_i=\frac{\mu_i}{\sum_{i=1}^K \mu_i},$$
for $i \in{1, \ldots, K}$ yields a random vector $\theta$ from the probability simplex of dimension $K-$ 1 , such that $\theta$ distributes according to the Dirichlet distribution with hyperparameters $\alpha=$ $\left(\alpha_1, \ldots, \alpha_K\right)$

The representation of the Dirichlet as independent, normalized, Gamma variables explains a limitation inherent to the Dirichlet distribution. There is no explicit parametrization of the rich structure of relationships between the coordinates of $\theta$. For example, given $i \neq j$, the ratio $\theta_i / \theta_j$, when treated as a random variable, is independent of any other ratio $\theta_k / \theta_{\ell}$ calculated from two other coordinates, $k \neq \ell$. (This is evident from Equation 3.12: the ratio $\theta_i=\theta$ is $\mu_i=\mu_j$, where all $\mu_i$ for $i \in{1, \ldots, K}$ are independent.) Therefore, the Dirichlet distribution is not a good modeling choice when the $\theta$ parameters are better modeled even with a weak degree of dependence.

Dirichlet 分布具有对 Gamma 分布的还原表示。这种表示不会直接有助于更好的建模，但有助于 证明 Dirichlet 分布的局限性，并提出替代方案（例如下一节中描述的）。

B) 。然后，定义
$$\theta_i=\frac{\mu_i}{\sum_{i=1}^K \mu_i}$$

Dirichlet 表示为独立的、归一化的 Gamma 变量，这解释了 Dirichlet 分布固有的局限性。坐标之 间丰富的关系结构没有明确的参数化 $\theta$. 例如，给定 $i \neq j$ ，比例 $\theta_i / \theta_j$ ，当被视为随机变量时，独立 于任何其他比率 $\theta_k / \theta_{\ell}$ 从另外两个坐标计算， $k \neq \ell$. (从公式 3.12 可以明显看出这一点: 比率 $\theta_i=\theta$ 是 $\mu_i=\mu_j$, 其中所有 $\mu_i$ 为了 $i \in 1, \ldots, K$ 是独立的。) 因此，Dirichlet 分布不是一个好 的建模选择，当 $\theta$ 即使依赖程度较弱，参数也能更好地建模

