If you had to put a number (say, between 0 and 1 ) on the strength of the linear association between house prices and sizes in Figure 4.1, what would it be? Your measure shouldn’t depend on the choice of units for the variables. Zillow could have reported the house sizes in square meters and the price in thousands of dollars, but regardless of the units, the scatterplot would look the same. When we change units, the direction, form, and strength won’t change, so neither should our measure of the association’s (linear) strength.

We saw a way to remove the units in the previous chapter. We can standardize each of the variables, finding $z_{x}=\left(\frac{x-\bar{x}}{s_{x}}\right)$ and $z_{y}=\left(\frac{y-\bar{y}}{s_{y}}\right)$. With these, we can compute a measure of strength that you’ve probably heard of-the correlation coefficient:
$$r=\frac{\sum z_{x} z_{y}}{n-1}$$
Keep in mind that the $x$ ‘s and $y$ ‘s are paired. For each house we have a price and a living area. To find the correlation we multiply each standardized value by the standardized value it is paired with and add up those cross products. We divide the total by the number of pairs minus one, $n-1.2$

There are alternative formulas for the correlation in terms of the variables $x$ and $y$. Here are two of the more common:
$$r=\frac{\sum(x-\bar{x})(y-\bar{y})}{\sqrt{\sum(x-\bar{x})^{2} \sum(y-\bar{y})^{2}}}=\frac{\sum(x-\bar{x})(y-\bar{y})}{(n-1) s_{x} s_{y}}$$

Correlation measures the strength of the linear association between two quantitative variables. Before you use correlation, you must check three conditions:

• Quantitative Variables Condition: Correlation applies only to quantitative variables. Don’t apply correlation to categorical data masquerading as quantitative. Check that you know the variables’ units and what they measure.
• Linearity Condition: Sure, you can calculate a correlation coefficient for any pair of variables. But correlation measures the strength only of the linear association and will be misleading if the relationship is not straight enough. What is “straight enough”? This question may sound too informal for a statistical condition, but that’s really the point. We can’t verify whether a relationship is linear or not. Very few relationships between variables are perfectly linear, even in theory, and scatterplots of real data are never perfectly straight. How nonlinear looking would the scatterplot have to be to fail the condition? This is a judgment call that you just have to think about. Do you think that the underlying relationship is curved? If so, then summarizing its strength with a correlation would be misleading.
• Outlier Condition: Unusual observations can distort the correlation and can make an otherwise small correlation look big or, on the other hand, hide a large correlation. It can even give an otherwise positive association a negative correlation coefficient (and vice versa). When you see an outlier, it’s often a good idea to report the correlation both with and without the point.

Each of these conditions is easy to check with a scatterplot. Many correlations are reported without supporting data or plots. You should still think about the conditions. You should be cautious in interpreting (or accepting others’ interpretations of) the correlation when you can’t check the conditions for yourself.

Throughout this course, you’ll see that doing statistics right means selecting the proper methods. That means you have to think about the situation at hand. An important first step is to check that the type of analysis you plan is appropriate. These conditions are just the first of many such checks.

## 商业统计代写

$$r=\frac{\sum z_{x} z_{y}}{n-1}$$

$$r=\frac{\sum(x-\bar{x})(y-\bar{y})}{\sqrt{\sum(x-\bar{x})^{2} \sum(y-\bar{y})^{2}}}=\frac{\sum(x-\bar{x})(y-\bar{y})}{(n-1) s_{x} s_{y}}$$

• 定量变量条件：相关性仅适用于定量变量。不要将相关性应用于伪装成定量的分类数 据。检龺您是否知道变量的单位以及它们测量的内容。
• 线性条件：当然，您可以计算任何一对变量的相关系数。但相关性仅衡量线性关联的 强度，如果关系不够直，则会产生误导。什么是“够直”? 这个问题对于统计条件来说可 能听起来太不正式，但这确实是重点。我们无法验证关系是否是线性的。即使在理论 上，变量之间的关系也很少是完全线性的，并且真实数据的散点图从来都不是完全䇻 直的。散点图的非线性看起来有多非线性才能使条件失败? 这是一个你只需要考虑的 判断电话。你认为潜在的关系是弯曲的吗? 如果是这样，那么用相关性来总结其强度 将是误导性的。
• 异常值条件：不寻常的观察会扭曲相关性，并使原本很小的相关性看起来很大，或者 另一方面，隐藏大的相关性。它甚至可以给一个正相关的负相关系数（反之亦然）。
当您看到异常值时，报告有无该点的相关性通常是一个好主意。
这些条件中的每一个都可以通过散点图轻松检龺。许多相关性在没有支持数据或图表的情 况下被报告。您仍然应该考虑条件。当你无法自己检查条件时，您应该谨慎解释（或接受 他人的解释) 相关性。
在本课程中，您将看到正确进行统计意味着选择正确的方法。这意味着你必须考虑手头的 情况。重要的第一步是检龺您计划的分析类型是否合适。这些条件只是许多此类检柦中的 第一个。

