%% This is the Sweave file for %%"Cluster-robust standard errors using R (www.r-project.org)" %% Run this file in R: `Sweave("clustering.Rnw")' %% Author: Mahmood Arai 2009 02 10 %% revised 2009 05 23 %% added a function for group demeaning to use instead %% of using plm to obtain group-demeaned data. %% revised 2010 10 27 %% minor typos fixed. %% revised 20110131 %% the plm syntax modified to adapt to the changes in plm \documentclass[10pt,a4paper]{article} \usepackage[utf8x]{inputenc} \usepackage{color,amsmath,amsfonts,natbib} \usepackage{Sweave,fancyvrb,color,url,hyperref,url} \SweaveOpts{echo=TRUE, keep.source=TRUE} \setkeys{Gin}{width=\textwidth} \setlength{\parindent}{0pt} \setlength{\parskip}{0.2cm} \newcommand{\R}{\textbf{R}} \newcommand{\proglang}[1]{{\upshape\mdseries\sffamily #1}} \newcommand{\pkg}[1]{{\upshape\bfseries\rmfamily #1}} \definecolor{Red}{rgb}{0.5,0,0} \definecolor{Blue}{rgb}{0,0,0.5} \setlength{\textheight}{640pt} \setlength{\textwidth}{400pt} \hypersetup{% breaklinks = {true}, colorlinks = {true}, linkcolor = {Blue}, citecolor = {Blue}, urlcolor = {Red} } \begin{document} \title{Cluster-robust standard errors using \htmladdnormallink{R}{http://www.r-project.org}} \author{\htmladdnormallink{Mahmood Arai}{http://people.su.se/~ma} \\Department of Economics and SULCIS \\Stockholm University} \maketitle \begin{quote} \section{Introduction} This note deals with estimating cluster-robust standard errors on one and two dimensions using R (see \citet{R-Core}). Cluster-robust standard errors are an issue when the errors are correlated within groups of observations. For discussion of robust inference under within groups correlated errors, see \citet{Wooldridge03}, \citet{Cameron06}, and \citet{Petersen06} and the references therein. Two data sets are used. The first data set is panel data from Introduction to Econometrics by \citet{StockWatson06BOOK}, chapter 10. The second data set is the Mitchell Petersen's test data for two-way clustering. The first part of this note deals with estimation of fixed-effects model using the Fatality data. The second part deals with cluster-robust standard errors. You need to install package \verb+lmtest+ by Torsten Hothorn, Achim Zeileis, Giovanni Millo and David Mitchell, package \verb+sandwich+ by Thomas Lumley and Achim Zeileis, package \verb+plm+ by Yves Croissant and Giovanni Millo and \verb+Ecdat+ by Yves Croissant. The function \verb+robcov+ in the package \verb+Design+ by Frank E. Harrell Jr can be used for clustering in one dimension in case of an \verb+ols-fit+. The function \verb+plm+ can be used for obtaining one-way clustered standard errors. \section{Estimating fixed-effects model} The data set \verb+Fatality+ in the package \verb+Ecdat+ cover data for 48 US states over 7 years. One way to estimate such a model is to include fixed group intercepts in the model. This is an example estimating a two-way fixed effects model. <>= library(Ecdat) data(Fatality) LSDV <- lm(mrall ~ beertax + factor(year) + factor(state), data=Fatality) @ When the number of groups are large, we run into the incidental parameter problem, implying inconsistent parameter estimates, and will have computational problems inverting a large model-matrix. A solution is to use variables measured as deviation from group mean in estimation. Using such an transformation we have to correct the degree of freedom for the number of group means that are estimated using the transformation. Let us first we write a function that computes deviation from group means of our variables. This makes only sense for numeric variables. The following function takes a dataframe \verb+df1+ and yields a new data set including the original data and new variables computed as deviation from group means as defined by the argument \verb+group+. The group centered variables have the same names as the original variables with a prefix \verb+C.+. <>= gcenter <- function(df1,group) { variables <- paste( rep("C", ncol(df1)), colnames(df1), sep=".") copydf <- df1 for (i in 1:ncol(df1)) { copydf[,i] <- df1[,i] - ave(df1[,i], group,FUN=mean)} colnames(copydf) <- variables return(cbind(df1,copydf))} @ Now we use this function to obtain transformed data. <>= centerFatality <- gcenter(Fatality[,1:4], Fatality$state) @ We can then use this transformed data and run \verb+OLS+ using the same model as before. <>= fmlm <- lm(C.mrall ~ C.beertax + factor(year), data=centerFatality) @ The variance-covariance matrix must be reweighted with $dfcw=(N-k)/((N - k) - (M-1))$ where $N$ is the total number of observations, $M$ is the number of groups and $K$ is the model rank. <>= library(sandwich) M <- length(unique(Fatality$state)) dfcw <- fmlm$df / (fmlm$df - (M -1)) library(lmtest) coeftest(fmlm, dfcw*vcov(fmlm)) @ The second argument reweights the variance-covariance matrix correcting the degrees of freedom accounting for the fact that data are centered around group means. The standard errors are valid under constant error variance. However, in fixed-effects models you should use cluster-robust standard errors as described in the next section -- See \citet{arellano87}, \citet{wooldridge02} and \citet{StockWatson06NBER}. The package \verb+plm+ can be used to compute one-way cluster-robust standard errors. <>= library(plm) fmplm <- plm(mrall~ beertax + factor(year), data=Fatality) @ The degree-of-freedom of \verb+arellano+ in \verb+plm+ using \verb+HC1+ is $N/(N-K)$. In the next section we use a slightly different degree-of-freedom correction in order to replicate \citet{StockWatson06BOOK} and \citet{Petersen06}. \section{Cluster-robust standard errors} The file \verb+http://people.su.se/~ma/clmclx.R+ includes two functions: \verb+clx+ and \verb+mclx+ for one and two-way clustering. These functions are shown below. The functions have the following arguments: \begin{itemize} \item The fitted model \verb+fm+ \item A factor for the degree of freedom correction when we have estimated on deviation from group mean data, \verb+dfcw+. Set this argument to \verb+1+ when such a degree of freedom correction is not necessary. \item The cluster variable or variables, \verb+cluster+, \verb+ cluster1+, \verb+cluster2+. \end{itemize} The functions have two parts. For $N$ observations, $M$ clusters, and $K= $rank($\mathbf{X}$) where $\mathbf{X}$ is the matrix of regressors, $(M/(M-1))*((N-1)/(N-K))$ is computed as degree of freedom correction. The second part of the function computes: \begin{equation} \mathbf{X'X}^ {-1} \mathbf{u}^\prime \mathbf{u} \mathbf{X'X}^{-1} \nonumber \end{equation} where $\mathbf{u}$ is a $M \times K$ matrix with rows $u_j$. Each row is the per cluster sum of $\mathbf{X}_j * \mathbf{e}_j$ over all individuals within each cluster. Denoting the number of observations in cluster $j$ as $N_j$, $\mathbf{X}_j$ is a $N_j \times K $ matrix of regressors for cluster $j$, the star $*$ denotes element by elements multiplication and $\mathbf{e}_j$ is a $N_j \times 1$ vector of residuals. The $\mathbf{X}_j * \mathbf{e}_j$ is estimated using the function \verb+estfun+. Summing over observations per cluster using \verb+apply(...)+ yields $\mathbf{u}$. This is then used as the argument in the function \verb+sandwich+ to obtain the variance covariance matrix (\citet{Zeileis06}). The function \verb+mclx+ has the same structure repeated over clusters and the overlap between clusters and finally summarized as summing up over clusters minus the overlap. The functions \verb+clx+ for one-way clustering. <>= clx <- function(fm, dfcw, cluster){ library(sandwich) library(lmtest) M <- length(unique(cluster)) N <- length(cluster) dfc <- (M/(M-1))*((N-1)/(N-fm$rank)) u <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum)) vcovCL <- dfc*sandwich(fm, meat=crossprod(u)/N)*dfcw coeftest(fm, vcovCL) } @ Clustered on state, replicating Stock and Watson <>= clx(fmlm, dfcw, Fatality$state) @ The function \verb+mclx+ for two-way clustering. <>= mclx <- function(fm, dfcw, cluster1, cluster2){ library(sandwich) library(lmtest) cluster12 = paste(cluster1,cluster2, sep="") M1 <- length(unique(cluster1)) M2 <- length(unique(cluster2)) M12 <- length(unique(cluster12)) N <- length(cluster1) K <- fm$rank dfc1 <- (M1/(M1-1))*((N-1)/(N-K)) dfc2 <- (M2/(M2-1))*((N-1)/(N-K)) dfc12 <- (M12/(M12-1))*((N-1)/(N-K)) u1 <- apply(estfun(fm), 2, function(x) tapply(x, cluster1, sum)) u2 <- apply(estfun(fm), 2, function(x) tapply(x, cluster2, sum)) u12 <- apply(estfun(fm), 2, function(x) tapply(x, cluster12, sum)) vc1 <- dfc1*sandwich(fm, meat=crossprod(u1)/N ) vc2 <- dfc2*sandwich(fm, meat=crossprod(u2)/N ) vc12 <- dfc12*sandwich(fm, meat=crossprod(u12)/N) vcovMCL <- (vc1 + vc2 - vc12)*dfcw coeftest(fm, vcovMCL)} @ The following applies the clustering functions on Mitchell Petersen's test-data. Download the \verb+test_data.txt+ from Petersens se-programming page and create a \verb+lm+ object by running \verb+y+ on \verb+x+ using the data \verb+test+. <>= SITE <- "http://www.kellogg.northwestern.edu/faculty/petersen/" URLdata <- paste(SITE, "/htm/papers/se/test_data.txt", sep="") VarNames <- c("firmid", "year", "x", "y") test <- read.table(file=URLdata, col.names=VarNames) fm <- lm(y ~ x, data=test) @ To cluster on firm, the arguments are the fitted model \verb+fm+, we need no degree of freedom correction since we have estimated the model on the original location (no transformation) implying that the second argument is \verb+1+.The third argument specify the cluster variable. The variable for firm indicator is \verb+firmid+ in the data \verb+test+. <>= clx(fm,1,test$firmid) @ The following yields clustering on year. <>= clx(fm,1, test$year) @ For clustering on firm and year we use the function \verb+mclx+. <>= mclx(fm,1, test$firmid, test$year) @ These results were obtained on a \verb+\Sexpr{sessionInfo()$R.version$platform}+ platform using \proglang \Sexpr{sessionInfo()$R.version$version.string} \citep{R-Core} with packages \pkg{\Sexpr{sessionInfo()$otherPkgs$lmtest$Package}} \Sexpr{sessionInfo()$otherPkgs$lmtest$Version} (\Sexpr{sessionInfo()$otherPkgs$lmtest$Date}), \pkg{\Sexpr{sessionInfo()$otherPkgs$sandwich$Package}} \Sexpr{sessionInfo()$otherPkgs$sandwich$Version} (\Sexpr{sessionInfo()$otherPkgs$sandwich$Date}), \pkg{\Sexpr{sessionInfo()$otherPkgs$plm$Package}} \Sexpr{sessionInfo()$otherPkgs$plm$Version} (\Sexpr{sessionInfo()$otherPkgs$plm$Date}) and \pkg{\Sexpr{sessionInfo()$otherPkgs$Ecdat$Package}} \Sexpr{sessionInfo()$otherPkgs$Ecdat$Version} (\Sexpr{sessionInfo()$otherPkgs$Ecdat$Date}). \bibliography{LittBib} \bibliographystyle{abbrvnat} \end{quote} \end{document} <>= write(paste( "@Article{arellano87, author={Arellano, M}, title={Computing Robust Standard Errors for Within-Groups Estimators}, journal={Oxford Bulletin of Economics and Statistics}, year=1987, volume={49}, number={4}, pages={431-34}, month={November}, url={http://ideas.repec.org/a/bla/obuest/v49y1987i4p431-34.html} }", "@Article{Cameron06, title = {Robust Inference with Multiway Clustering }, author = {Colin A. Cameron and Jonah B. Gelbach and Douglas L. Miller}, journal= {NBER Working Paper}, year = {2006}, volume = {T0327}, URL = {http://www.nber.org/papers/t0327} }", "@Article{CroissantMillo08, title = {Panel Data Econometrics in R: The plm Package}, author = {Yves Croissant and Giovanni Millo}, journal= {Journal of Statistical Software}, year = {2008}, volume = {27}, URL = {http://www.jstatsoft.org/v27/i02} }", "@Article{LongEtAl00, title = {Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model}, author = {{Long, J.S and L.H. Ervin}}, journal= {The American Statistician}, year = {2000}, volume= {54}, pages={217-224} }", "@Article{MacKinnonWhite85, title = {Some heteroskedasticity consistent covariance matrix estimators with improved finite sample properties}, author = {{MacKinnon, J.G. and H. White}}, journal= {Journal of Econometrics}, year = {1985}, volume= {29}, pages={53-57} }", "@TechReport{Petersen06, author={Mitchell A. Petersen}, title={Estimating Standard Errors in Finance Panel Data Sets: Comparing Approaches}, year=2005, month=Apr, institution={National Bureau of Economic Research, Inc}, type={NBER Working Papers}, url={http://ideas.repec.org/p/nbr/nberwo/11280.html}, number={11280} }", "@Manual{R-Core, title = {R: A Language and Environment for Statistical Computing}, author = {{R Development Core Team}}, organization = {R Foundation for Statistical Computing}, address = {Vienna, Austria}, year = {2007}, note = {{ISBN} 3-900051-07-0}, url = {http://www.R-project.org} }", "@TechReport{StockWatson06NBER, author ={James H. Stock and Mark W. Watson}, title = {Heteroskedasticity-Robust Standard Errors for Fixed Effects Panel Data Regression}, year = 2006, month = Jun, institution={National Bureau of Economic Research, Inc}, type = {NBER Technical Working Papers}, url = {http://ideas.repec.org/p/nbr/nberte/0323.html}, number={0323} }", "@Book{StockWatson06BOOK, title = {Introduction to Econometrics}, author = {James H. Stock and Mark W. Watson}, publisher= {Addison Wesley}, year = {2006} }", "@Article{White80, title = {A heteroskedastic-consistent covariance matrix estimator and a direct test of heteroskedasticity}, author = {Harel White}, journal = {Econometrica}, year = {1980}, volume = {48}, pages = {817-838} }", "@Book{Wooldridge02, title = {Econometric Analysis of Cross Section and Panel Data}, author = {Jeffrey M. Wooldridge}, publisher= {Cambridge, MA: MIT Press}, year = {2002}, }", "@Article{Wooldridge03, title = {Cluster-sample Methods in Applied Econometrics }, author = {Jeffrey M. Wooldridge}, journal = {American Economic Review}, year = {2003}, volume = {93}, pages ={133-138} }", "@Article{Zeileis06, title = {Object-oriented Computation of Sandwich Estimators }, author = {Achim Zeileis}, journal = {Journal of Statistical Software}, year = {2006}, volume = {i09}, url = {http://www.jstatsoft.org/v16/i09} }", sep="\n"), file="LittBib.bib") @ <>= system("bibtex clustering")