# At What Level to Cluster?

## 经济代写|计量经济学代写Introduction to Econometrics代考|At What Level to Cluster?

A practical question which arises in the context of cluster-robust inference is “At what level should we cluster?” In some examples you could cluster at a very fine level, such as families or classrooms, or at higher levels of aggregation, such as neighborhoods, schools, towns, counties, or states. What is the correct level at which to cluster? Rules of thumb have been advocated by practitioners but at present there is little formal analysis to provide useful guidance. What do we know?

First, suppose cluster dependence is ignored or imposed at too fine a level (e.g. clustering by households instead of villages). Then variance estimators will be biased as they will omit covariance terms. As correlation is typically positive, this suggests that standard errors will be too small giving rise to spurious indications of significance and precision.

Second, suppose cluster dependence is imposed at too aggregate a measure (e.g. clustering by states rather than villages). This does not cause bias. But the variance estimators will contain many extra components so the precision of the covariance matrix estimator will be poor. This means that reported standard errors will be imprecise – more random – than if clustering had been less aggregate.

Proof of Theorem 4.6 The proof technique is to calculate the Cramér-Rao bound from a carefully crafted parametric model. (For the Cramér-Rao Theorem, see, for example, Chapter 10 of Introduction to Econometrics.) We use a conditional version of the Cramér-Rao Theorem: If $f(y \mid \boldsymbol{x}, \boldsymbol{\theta})$ is a correctly specified probability model which depends on a finite dimensional parameter $\boldsymbol{\theta} \in \boldsymbol{\Theta}$, the support of $y$ does not depend on $\boldsymbol{\theta}, \boldsymbol{\theta}$ lies in the interior of $\boldsymbol{\theta}$, and if $\tilde{\boldsymbol{\theta}}$ is an unbiased estimator of $\boldsymbol{\theta}$ based on a sample of size $n$, then $\operatorname{var}[\tilde{\boldsymbol{\theta}} \mid \boldsymbol{X}] \geq\left(\sum_{i=1}^n \mathscr{I}{\boldsymbol{\theta}}\left(\boldsymbol{x}_i\right)\right)^{-1}$ where $\mathscr{I}{\boldsymbol{\theta}}(\boldsymbol{x})$ is the information matrix for model $f(y \mid \boldsymbol{x}, \boldsymbol{\theta})$.

For ease of exposition we focus on the case where $e_i$ has a conditional density $f(e \mid x)$. (The same argument applies to the discrete case using instead the probability mass function.)

The idea is as follows. The Cramér-Rao Theorem shows that within a parametric model an unbiased estimator cannot have lower variance than the inverse information matrix. This is true for any correctly-specified parametric model – which means any parametric model which includes the true distribution as a special case. Thus any correctly-specified parametric model produces a valid variance lower bound. The best bound is the supremum across these variance lower bounds. Rather than computing that directly we recognize that our goal is to produce a model with the specific variance lower bound $\left(\boldsymbol{X}^{\prime} \boldsymbol{D}^{-1} \boldsymbol{X}\right)^{-1}$. This is achieved if the information matrix equals $\boldsymbol{X}^{\prime} \boldsymbol{D}^{-1} \boldsymbol{X}$, which is achieved if the model has the likelihood score $x_i e_i \sigma_i^{-2}$. This suggests the parametric model for the error $e_i$
$$f(e \mid \boldsymbol{x}, \boldsymbol{\theta})=f(e \mid \boldsymbol{x})\left(1+\frac{\boldsymbol{\theta}^{\prime} \boldsymbol{x} e}{\sigma^2(\boldsymbol{x})}\right)$$
where $f(e \mid \boldsymbol{x})$ is the true conditional density. This model does not quite work, however, since this density is not necessarily non-negative. Consequently we use a technically more detailed argument using trimming to ensure a non-negative density.
For some $0<c<\infty$ define
$$\bar{\sigma}^2(\boldsymbol{x})=\mathbb{E}\left[e_i^2 \mathbb{1}\left(\left|e_i\right| \leq c / 2\right) \mid \boldsymbol{x}i=\boldsymbol{x}\right]$$ and $\bar{\sigma}_i^2=\bar{\sigma}^2\left(\boldsymbol{x}_i\right)$. Notice that as $c \rightarrow \infty, \bar{\sigma}_i^2 \rightarrow \sigma_i^2$ for each $i$. Set $\delta>0$. Pick $c$ sufficiently large so that $\bar{\sigma}_i^2 \geq \delta$ for all $i$. Let $M=\max {i \leq n}\left|\boldsymbol{x}_i\right|$.
Define the trimmed error
$$u_i=e_i \mathbb{1}\left(\left|e_i\right| \leq c / 2\right)-\mathbb{E}\left[e_i \mathbb{1}\left(\left|e_i\right| \leq c / 2\right) \mid \boldsymbol{x}_i\right]$$
Notice that $u_i$ satisfies $\left|u_i\right| \leq c, \mathbb{E}\left[u_i \mid \boldsymbol{x}_i\right]=0$, and $\bar{\sigma}_i^2=\mathbb{E}\left[e_i u_i \mid \boldsymbol{x}_i=\boldsymbol{x}\right]$.

