## 统计代写|多元统计分析代写Multivariate Statistical Analysis代考|DETECTING OUTLIERS AND CLEANING DATA

Most data sets contain one or a few unusual observations that do not seem to belong to the pattern of variability produced by the other observations. With data on a single characteristic, unusual observations are those that are either very large or very small relative to the others. The situation can be more complicated with multivariate data. Before we address the issue of identifying these outliers, we must emphasize that not all outliers are wrong numbers. They may, justifiably, be part of the group and may lead to a better understanding of the phenomena being studied.
Outliers are best detected visually whenever this is possible. When the number of observations $n$ is large, dot plots are not feasible. When the number of characteristics $p$ is large, the large number of scatter plots $p(p-1) / 2$ may prevent viewing them all. Even so, we suggest first visually inspecting the data whenever possible.

What should we look for? For a single random variable, the problem is one dimensional, and we look for observations that are far from the others. For instance, the dot diagram
reveals a single large observation.
In the bivariate case, the situation is more complicated. Figure $4.10$ on page 201 shows a situation with two unusual observations.

The data point circled in the upper right corner of the figure is removed from the pattern, and its second coordinate is large relative to the rest of the $x_{2}$ measurements, as shown by the vertical dot diagram. The second outlier, also circled, is far from the elliptical pattern of the rest of the points, but, separately, each of its components has a typical value. This outlier cannot be detected by inspecting the marginal dot diagrams.

In higher dimensions, there can be outliers that cannot be detected from the univariate plots or even the bivariate scatter plots. Here a large value of $\left(\mathbf{x}{j}-\overline{\mathbf{x}}\right)^{\prime} \mathbf{S}^{-1}\left(\mathbf{x}{j}-\overline{\mathbf{x}}\right)$ will suggest an unusual observation, even though it cannot be seen visually.

## 统计代写|多元统计分析代写Multivariate Statistical Analysis代考|Steps for Detecting Outliers

1. Make a dot plot for each variable.
2. Make a scatter plot for each pair of variables.
3. Calculate the standardized values $z_{j k}=\left(x_{j k}-\bar{x}{k}\right) / \sqrt{s{k k}}$ for $j=1,2, \ldots, n$ and each column $k=1,2, \ldots, p$. Examine these standardized values for large or small values.
4. Calculate the generalized squared distances $\left(\mathbf{x}{j}-\overline{\mathbf{x}}\right)^{\prime} \mathbf{S}^{-1}\left(\mathbf{x}{j}-\overline{\mathbf{x}}\right)$. Examine these distances for unusually large values. In a chi-square plot, these would be the points farthest from the origin.

In step 3, “large” must be interpreted relative to the sample size and number of variables. There are $n \times p$ standardized values. When $n=100$ and $p=5$, there are 500 values. You expect 1 or 2 of these to exceed 3 or be less than $-3$, even if the data came from a multivariate distribution that is exactly normal. As a guideline, $3.5$ might be considered large for moderate sample sizes.

In step 4, “large” is measured by an appropriate percentile of the chi-square distribution with $p$ degrees of freedom. If the sample size is $n=100$, we would expect 5 observations to have values of $d_{j}^{2}$ that exceed the upper fifth percentile of the chi-square distribution. A more extreme percentile must serve to determine observations that do not fit the pattern of the remaining data.

The data we presented in Table $4.3$ concerning lumber have already been cleaned up somewhat. Similar data sets from the same study also contained data on $x_{5}=$ tensile strength. Nine observation vectors, out of the total of 112 , are given as rows in the following table, along with their standardized values.

