Posted on Categories:Multivariate Statistical Analysis, 多元统计分析, 统计代写, 统计代考

# 统计代写|多元统计分析代写Multivariate Statistical Analysis代考|STAT5610 DETECTING OUTLIERS AND CLEANING DATA

avatest™

## avatest™帮您通过考试

avatest™的各个学科专家已帮了学生顺利通过达上千场考试。我们保证您快速准时完成各时长和类型的考试，包括in class、take home、online、proctor。写手整理各样的资源来或按照您学校的资料教您，创造模拟试题，提供所有的问题例子，以保证您在真实考试中取得的通过率是85%以上。如果您有即将到来的每周、季考、期中或期末考试，我们都能帮助您！

•最快12小时交付

•200+ 英语母语导师

•70分以下全额退款

## 统计代写|多元统计分析代写Multivariate Statistical Analysis代考|DETECTING OUTLIERS AND CLEANING DATA

Most data sets contain one or a few unusual observations that do not seem to belong to the pattern of variability produced by the other observations. With data on a single characteristic, unusual observations are those that are either very large or very small relative to the others. The situation can be more complicated with multivariate data. Before we address the issue of identifying these outliers, we must emphasize that not all outliers are wrong numbers. They may, justifiably, be part of the group and may lead to a better understanding of the phenomena being studied.
Outliers are best detected visually whenever this is possible. When the number of observations $n$ is large, dot plots are not feasible. When the number of characteristics $p$ is large, the large number of scatter plots $p(p-1) / 2$ may prevent viewing them all. Even so, we suggest first visually inspecting the data whenever possible.

What should we look for? For a single random variable, the problem is one dimensional, and we look for observations that are far from the others. For instance, the dot diagram
reveals a single large observation.
In the bivariate case, the situation is more complicated. Figure $4.10$ on page 201 shows a situation with two unusual observations.

The data point circled in the upper right corner of the figure is removed from the pattern, and its second coordinate is large relative to the rest of the $x_{2}$ measurements, as shown by the vertical dot diagram. The second outlier, also circled, is far from the elliptical pattern of the rest of the points, but, separately, each of its components has a typical value. This outlier cannot be detected by inspecting the marginal dot diagrams.

In higher dimensions, there can be outliers that cannot be detected from the univariate plots or even the bivariate scatter plots. Here a large value of $\left(\mathbf{x}{j}-\overline{\mathbf{x}}\right)^{\prime} \mathbf{S}^{-1}\left(\mathbf{x}{j}-\overline{\mathbf{x}}\right)$ will suggest an unusual observation, even though it cannot be seen visually.

## 统计代写|多元统计分析代写Multivariate Statistical Analysis代考|Steps for Detecting Outliers

1. Make a dot plot for each variable.
2. Make a scatter plot for each pair of variables.
3. Calculate the standardized values $z_{j k}=\left(x_{j k}-\bar{x}{k}\right) / \sqrt{s{k k}}$ for $j=1,2, \ldots, n$ and each column $k=1,2, \ldots, p$. Examine these standardized values for large or small values.
4. Calculate the generalized squared distances $\left(\mathbf{x}{j}-\overline{\mathbf{x}}\right)^{\prime} \mathbf{S}^{-1}\left(\mathbf{x}{j}-\overline{\mathbf{x}}\right)$. Examine these distances for unusually large values. In a chi-square plot, these would be the points farthest from the origin.

In step 3, “large” must be interpreted relative to the sample size and number of variables. There are $n \times p$ standardized values. When $n=100$ and $p=5$, there are 500 values. You expect 1 or 2 of these to exceed 3 or be less than $-3$, even if the data came from a multivariate distribution that is exactly normal. As a guideline, $3.5$ might be considered large for moderate sample sizes.

In step 4, “large” is measured by an appropriate percentile of the chi-square distribution with $p$ degrees of freedom. If the sample size is $n=100$, we would expect 5 observations to have values of $d_{j}^{2}$ that exceed the upper fifth percentile of the chi-square distribution. A more extreme percentile must serve to determine observations that do not fit the pattern of the remaining data.

The data we presented in Table $4.3$ concerning lumber have already been cleaned up somewhat. Similar data sets from the same study also contained data on $x_{5}=$ tensile strength. Nine observation vectors, out of the total of 112 , are given as rows in the following table, along with their standardized values.

## 统计代写|多元统计分析代写Multivariate Statistical Analysis代 考|Steps for Detecting Outliers

1. 为每个变量绘制一个点图。
2. 为每对变量绘制散点图。
3. 计算标准化值 $z_{j k}=\left(x_{j k}-\bar{x} k\right) / \sqrt{s k k}$ 为了 $j=1,2, \ldots, n$ 和每一列 $k=1,2, \ldots, p$. 检龺这些标准 化值的大值或小值。
4. 计算广义平方距离 $(\mathbf{x} j-\overline{\mathbf{x}})^{\prime} \mathbf{S}^{-1}(\mathbf{x} j-\overline{\mathbf{x}})$. 检龺这些距离是否有异常大的值。在卡方图中，这些点是 离原点最远的点。
在第 3 步中，“大”必须相对于样本大小和变量数室进行解释。有 $n \times p$ 标准化值。什么时候 $n=100$ 和 $p=5$ 有 500 个值。您期望其中 1 或 2 个超过 3 或小于 $-3$ ，即使数据来自完全正态的多元分布。作为指导方针， $3.5$ 对于中等样本量，可能会被认为是大的。
在第 4 步中，“大”是通过卡方分布的适当百分位数来衡量的 $p$ 自由程度。如果样本量为 $n=100$ ，我们预计 5 个观测值的值为 $d_{j}^{2}$ 超过卡方分布的上五分位数。必须使用更极端的百分位数来确定不符合剩余数据模式的观察 结果。
我们在表中提供的数据 $4.3$ 关于木材已经清理了一些。来自同一研究的类似数据集也包含关于 $x_{5}=$ 抗拉强度。 总共 112 个观察向量中的 9 个观察向量在下表中以行的形式给出，以及它们的标准化值。

## MATLAB代写

MATLAB 是一种用于技术计算的高性能语言。它将计算、可视化和编程集成在一个易于使用的环境中，其中问题和解决方案以熟悉的数学符号表示。典型用途包括：数学和计算算法开发建模、仿真和原型制作数据分析、探索和可视化科学和工程图形应用程序开发，包括图形用户界面构建MATLAB 是一个交互式系统，其基本数据元素是一个不需要维度的数组。这使您可以解决许多技术计算问题，尤其是那些具有矩阵和向量公式的问题，而只需用 C 或 Fortran 等标量非交互式语言编写程序所需的时间的一小部分。MATLAB 名称代表矩阵实验室。MATLAB 最初的编写目的是提供对由 LINPACK 和 EISPACK 项目开发的矩阵软件的轻松访问，这两个项目共同代表了矩阵计算软件的最新技术。MATLAB 经过多年的发展，得到了许多用户的投入。在大学环境中，它是数学、工程和科学入门和高级课程的标准教学工具。在工业领域，MATLAB 是高效研究、开发和分析的首选工具。MATLAB 具有一系列称为工具箱的特定于应用程序的解决方案。对于大多数 MATLAB 用户来说非常重要，工具箱允许您学习应用专业技术。工具箱是 MATLAB 函数（M 文件）的综合集合，可扩展 MATLAB 环境以解决特定类别的问题。可用工具箱的领域包括信号处理、控制系统、神经网络、模糊逻辑、小波、仿真等。