%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Mahmood Arai 2009 02 01 %% ma@ne.su.se %% This is the Sweave file %%"for A Brief Guide to R for Beginners in Econometrics" %% Run this file in R: `Sweave("R_intro.Rnw")' %% See http://www.stat.uni-muenchen.de/~leisch/Sweave/ %% for Sweave (written by Friedrich Leisch) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Preamble %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \documentclass[10pt,a4paper]{article} \usepackage[latin1]{inputenc} \usepackage{html,graphics,color,amsmath,amsfonts,natbib} \usepackage{Sweave,fancyvrb,color,url,hyperref} \SweaveOpts{echo=TRUE} \setkeys{Gin}{width=\textwidth} \setlength{\parindent}{0pt} \setlength{\parskip}{0.2cm} \newcommand{\R}{\textbf{R}} \definecolor{Red}{rgb}{0.5,0,0} \definecolor{Blue}{rgb}{0,0,0.5} \hypersetup{% breaklinks = {true}, colorlinks = {true}, linkcolor = {Blue}, citecolor = {Blue}, urlcolor = {Red} } \newcommand{\ma}{\htmladdnormallink{Mahmood Arai}{http://people.su.se/~ma}} \newcommand{\Rhome}{\htmladdnormallink{\R\ Homepage} {http://www.r-project.org/}} \newcommand{\CRAN}{\htmladdnormallink{CRAN-mirror} {http://cran.r-project.org/}} \newcommand{\CRANTV}{\htmladdnormallink{CRAN Task View: Computational Econometrics} {http://ftp.sunet.se/pub/lang/CRAN/web/views/Econometrics.html}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} \title{A Brief Guide to \R\ for Beginners in Econometrics} \author{\ma \\Department of Economics, Stockholm University\\ {\small First Version: 2002-11-05, This Version: 2009-09-02}} \date{} \maketitle \section{Introduction} \subsection{About \R} \R\ is published under the GPL (GNU Public License) and exists for all major platforms. \R\ is described on the \Rhome\ as follows: \begin{quotation} \textit{"R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. To download R, please choose your preferred \CRAN."} \end{quotation} See \Rhome\ for manuals and documentations. There are a number of books on \R. See \url{http://www.R-project.org/doc/bib/R.bib} for a bibliography of the R-related publications. \citet{Dalgaard} and \citet{Fox} are nice introductory books. For an advanced book see \citet{VR} which is a classic reference. \citet{AER} covers econometrics. See also \CRANTV.\ For weaving \R\ and \LaTeX see Sweave \url{http://www.stat.uni-muenchen.de/~leisch/Sweave/}. For reproducible research using \R\ see \citet{Koenker07}. To cite \R\ in publications you can refer to: \citet{Rcore} \subsection{About these pages} This is a brief manual for beginners in Econometrics. For latest version see \url{http://people.su.se/~ma/R_intro/R_intro.pdf}. For the Sweave file producing these pages see \url{http://people.su.se/~ma/R_intro/R_intro.Rnw}. The symbol \verb+#+ is used for comments. Thus all text after \verb+ # + in a line is a comment. Lines following \verb+ > + are \R-commands executed at the \R\ prompt which as standard looks like \verb+ >+. This is an example: <>= myexample <- "example" myexample @ R-codes including comments of codes that are not executed are indented as follows: {\small \begin{verbatim} myexample <- "example" # creates an object named myexample \end{verbatim}} The characters within \verb+< >+ refer to verbatim names of files, functions etc. when it is necessary for clarity. The names \verb++ such as \verb+,+ are used to refer to a general dataframe, object etc. \subsection{Objects and files} \R\ regards things as \textit{objects}. A dataset, vector, matrix, results of a regression, a plot etc. are all objects. One or several objects can be saved in a file. A file containing \R-data is not an object but a set of objects. Basically all commands you use are \textit{functions}. A command: something(object), does something on an object. This means that you are going to write lots of parentheses. Check that they are there and check that they are in the right place. \section{First things} \subsection{Installation} \R\ exists for several platforms and can be downloaded from [\CRAN]. \\ \subsection{Working with \R} It is a good idea to create a directory for a project and start \R\ from there. This makes it easy to save your work and find it in later sessions. If you want R to start in a certain directory in \textit{MS-Windows}, you have to specify the \verb++ to be your working directory. This is done by changing the \verb++ by clicking on the right button of the mouse while pointing at your R-icon, and then going to \verb++. \textit{Displaying the working directory within \R:} <>= getwd() @ Changing the working directory to an existing directory \verb++ {\small \begin{verbatim} setwd("/home/ma/project1") \end{verbatim}} \subsection{Naming in \R} Do not name an object as \verb++ or \verb++ use instead \verb++. Notice that in \R\ \verb++ and \verb++ are two different names. Names starting with a digit (\verb+<1a>+) is not accepted. You can instead use \verb++) You should not use names of variables in a data-frame as names of objects. If you do so, the object will shadow the variable with the same name in another object. The problem is then that when you call this variable you will get the object -- the object shadows the variable / the variable will be masked by the object with the same name. To avoid this problem: 1- Do not give a name to an object that is identical to the name of a variable in your data frames. 2- If you are not able to follow this rule, refer to variables by referring to the variable and the dataset that includes the variable. For example the variable \verb++ in the data frame \verb++ is called by: {\small \begin{verbatim} df1$wage. \end{verbatim}} The problem of "shadowing" concerns \R\ functions as well. Do not use object names that are the same as \R\ functions. \verb++ checks whether an object you have created conflicts with another object in the \R\ packages and lists them. You should only care about those that are listed under \verb+<.GlobalEnv>+ -- objects in your workspace. All objects listed under \verb+<.GlobalEnv>+ shadows objects in \R\ packages and should be removed in order to be able to use the objects in the \R\ packages. The following example creates \verb++ that should be avoided (since \verb++ stands for ), checks conflicts and resolves the conflict by removing \verb++. {\small \begin{verbatim} T <- "time" conflicts(detail=TRUE) rm(T) conflicts(detail=TRUE) \end{verbatim}} You should avoid using the following one-letter words \verb++ as names. They have special meanings in \R. \textit{Extensions for files} It is a good practice to use the extension \verb++ for your files including R-codes. A file \verb++ is then a text-file including R-codes. The extension \verb++ is appropriate for work images (i.e files created by \verb++). The file \verb++ is then a file including R-objects. The default name for the saved work image is \verb+<.RData>+. Be careful not to name a file as \verb+<.RData>+ when you use \verb++ as extension, since you will then overwrite the \verb+<.Rdata>+ file. \subsection{Saving and loading objects and images of working spaces} Download the file \verb++ \url{http://people.su.se/~ma/R_intro/data/}. You can read the file \verb++ containing the data frames \verb++ and \verb++ as follows. {\small \begin{verbatim} load("DataWageMacro.rda") ls() # lists the objects \end{verbatim}} The following command saves the object \verb++ in a file \verb++. {\small \begin{verbatim} save(lnu, file="mydata.rda") \end{verbatim}} To save an image of the your workspace that will be automatically loaded when you next time start \R\ in the same directory. {\small \begin{verbatim} save.image() \end{verbatim}} You can also save your working image by answering \verb++ when you quit and are asked \\ \verb++. In this way the image of your workspace is saved in the hidden file \verb+<.RData>+. You can save an image of the current workspace and give it a name \verb++. {\small \begin{verbatim} save.image("myimage.rda") \end{verbatim}} \subsection{Overall options} \verb++ can be used to set a number of options that governs various aspects of computations and displaying results. Here are some useful options. We start by setting the line with to 60 characters. <>= options(width=60) @ {\small \begin{verbatim} options(prompt=" R> ") # changes the prompt to < R> >. options(scipen=3) # From R version 1.8. This option # tells R to display numbers in fixed format instead of # in exponential form, for example <1446257064291> instead of # <1.446257e+12> as the result of . options() # displays the options. \end{verbatim}} \subsection{Getting Help} {\small \begin{verbatim} help.start() # invokes the help pages. help(lm) # help on , linear model. ?lm # same as above. \end{verbatim}} \section{Elementary commands} {\small \begin{verbatim} ls() # Lists all objects. ls.str() # Lists details of all objects str(myobject) # Lists details of . list.files() # Lists all files in the current directory. dir() # Lists all files in the current directory. myobject # Prints simply the object. rm(myobject) # removes the object . rm(list=ls()) # removes all the objects in the working space. save(myobject, file="myobject.rda") # saves the object in a file . load("mywork.rda")# loads "mywork.rda" into memory. summary(mydata) # Prints the simple statistics for . hist(x,freq=TRUE) # Prints a histogram of the object . # yields frequency and # yields probabilities. q() # Quits R. \end{verbatim}} The output of a command can be directed in an object by using \verb+< <- >+ , an object is then assigned a value. The first line in the following code chunk creates vector named \verb++ with a values $1$,$2$ and $3$. The second line creates an object named \verb++ and prints the contents of the object \verb++. <>= VV <- c(1,2,3) (VV <- 1:2) @ \section{Data management} \subsection{Reading data in plain text format:} \textit{Data in columns} The data in this example are from a text file: \texttt{\htmladdnormallink{}{ http://people.su.se/~ma/R_intro/data/tmp.txt}}, containing the variable names in the first line (separated with a space) and the values of these variables (separated with a space) in the following lines. The following reads the contents of the file \verb++ and assigns it to an object named \verb++. <>= FILE <- "http://people.su.se/~ma/R_intro/data/tmp.txt" dat <- read.table( file= FILE, header = TRUE) dat @ The argument \verb+
+ indicates that the first line includes the names of the variables. The object \verb++ is a data-frame as it is called in \R. If the columns of the data in the file \verb++ were separated by \verb+<,>+, the syntax would be: {\small \begin{verbatim} read.table("tmp.txt", header = TRUE, sep=",") \end{verbatim}} Note that if your decimal character is not \verb+<.>+ you should specify it. If the decimal character is \verb+<,>+, you can use \verb++ and specify the following argument in the function \verb++.\\ \subsection{Non-available and delimiters in tabular data} We have a file \verb++ with the following contents: {\small \begin{verbatim} 1 . 9 6 3 2 \end{verbatim}} where the first observation on the second column (variable) is a missing value coded as \verb+<.>+. To tell \R\ that \verb+ <.>+ is a missing value, you use the argument: \verb++ <>= FILE <- "http://people.su.se/~ma/R_intro/data/data1.txt" read.table(file=FILE ,na.strings="." ) @ Sometimes columns are separated by other separators than spaces. The separator might for example be \verb+ <,>+ in which case we have to use the argument \verb+ +. Be aware that if the columns are separated by \verb+ <,>+ and there are spaces in some columns like the case below the \verb+ + does not work. The NA is actually coded as two spaces, a point and two spaces, and should be indicated as: \verb+ +. {\small \begin{verbatim} 1, . ,9 6, 3 ,2 \end{verbatim}} Sometimes missing value is simply as follows. {\small \begin{verbatim} 1 9 6 3 2 \end{verbatim}} Notice that there are two spaces between 1 and 9 in the first line implying that the value in the second column is blank. This is a missing value. Here it is important to specify \verb++ along with\verb+ + . \subsection{Reading and writing data in other formats} Attach the library \verb++ in order to read data in various standard packages data formats. Examples are SAS, SPSS, STATA, etc. {\small \begin{verbatim} library(foreign) # reads the data and put it in the object lnu <- read.dta(file="wage.dta") \end{verbatim}} \verb++ , \verb++ etc. are other commands in the foreign package for reading data in SAS and SPSS format. \\ It is also easy to write data in a foreign format. The following codes writes the object to stata-file . {\small \begin{verbatim} library(foreign) write.dta(lnu,"lnunew.dta") \end{verbatim}} \subsection{Examining the contents of a data-frame object} Here we use data from \htmladdnormallink{Swedish Level of Living Surveys}{http://www2.sofi.su.se/LNU2000/english.htm} LNU 1991. <>= FILE <-"http://people.su.se/~ma/R_intro/data/lnu91.txt" lnu <- read.table(file=FILE, header = TRUE) @ Attaching the \verb++ data by \verb++ allows you to access the contents of the dataset \verb++ by referring to the variable names in the \verb++. If you have not attached the \verb++ you can use \verb++ to refer to the variable \verb++ in the data frame \verb++. When you do not need to have the data attached anymore, you can undo the \verb++ by \verb++ \textit{A description of the contents of the data frame lnu.} <>= str(lnu) # Description of the data structure summary(lnu) # A summary description of the data @ \subsection{Creating and removing variables in a data frame} Here we create a variable \verb++ as the logarithm of \verb++. Then we remove the variable. <>= lnu$logwage <- log(lnu$wage) lnu$logwage <- NULL @ Notice that you do not need to create variables that are simple transformations of the original variables. You can do the transformation directly in your computations and estimations. \subsection{Choosing a subset of variables in a data frame} {\small \begin{verbatim} # Read a of variables (wage,female) in lnu. lnu.female <- subset(lnu, select=c(wage,female)) # Putting together two objects (or variables) in a data frame. attach(lnu) lnu.female <- data.frame(wage,female) # Read all variables in lnu but female. lnux <- subset(lnu, select=-female) # The following keeps all variables from wage to public as listed above lnuxx <- subset(lnu, select=wage:public) \end{verbatim}} \subsection{Choosing a subset of observations in a dataset} {\small \begin{verbatim} attach(lnu) # Deleting observations that include missing value in a variable lnu <- na.omit(lnu) # Keeping observations for female only. fem.data <- subset(lnu, female==1) # Keeping observations for female and public employees only. fem.public.data <- subset(lnu, female==1 & public==1) # Choosing all observations where wage > 90 highwage <- subset(lnu, wage > 90) \end{verbatim}} \subsection{Replacing values of variables} We create a variable indicating whether the individual has university education or not by replacing the values in the schooling variable. Copy the schooling variable. <>= lnu$university <- lnu$school @ Replace university value with 0 if years of schooling is less than 13 years. <>= lnu$university <- replace(lnu$university, lnu$university<13, 0) @ Replace university value with 1 if years of schooling is greater than 12 years <>= lnu$university <- replace(lnu$university, lnu$university>12, 1) @ The variable \verb++ is now a dummy for university education. Remember to re-attach the data set after recoding. For creating category variables you can use \verb++. See further the section on below. <>= attach(lnu, warn.conflicts=FALSE) table(university) @ To create a dummy we could simply proceed as follows: <>= university <- school > 12 table(university) @ However, we usually do not need to create dummies. We can compute on \verb+ 12>+ directly, <>= table(school > 12) @ \subsection{Replacing missing values} We create a vector. Recode one value as missing value. And Then replace the missing with the original value. {\small \begin{verbatim} a <- c(1,2,3,4) # creates a vector is.na(a) <- a ==2 # recode a==2 as NA a <- replace(a, is.na(a), 2)# replaces the NA with 2 # or a[is.na(a)] <- 2 \end{verbatim}} \subsection{Factors} Sometimes our variable has to be redefined to be used as a category variable with appropriate levels that corresponds to various intervals. We might wish to have schooling categories that corresponds to schooling up to 9 years, 10 to 12 years and above 12 years. This could be coded by using \verb++. To include the lowest category we use the argument \verb+ +. <>= SchoolLevel <- cut(school, c(9,12, max(school), include.lowest=TRUE)) table(SchoolLevel) @ Labels can be set for each level. Consider the university variable created in the previous section. <>= SchoolLevel <- factor(SchoolLevel, labels=c("basic","gymnasium","university")) table(SchoolLevel) @ The factor defined as above can for example be used in a regression model. The reference category is the level with the lowest value. The lowest value is $1$ that corresponds to verb++ and the column for is not included in the contrast matrix. Changing the base category will remove another column instead of this column. This is demonstrated in the following example: <>= contrasts(SchoolLevel) contrasts(SchoolLevel) <- contr.treatment(levels(SchoolLevel),base=3) contrasts(SchoolLevel) @ The following redefines \verb++ as a numeric variable. <>= lnu$school <- as.numeric(lnu$school) @ \subsection{Aggregating data by group} Let us create a simple dataset consisting of 3 variables V1, V2 and V3. V1 is the group identity and V2 and V3 are two numeric variables. <>= (df1 <- data.frame(V1=1:3, V2=1:9, V3=11:19)) @ By using the command \verb++ we can create a new data.frame consisting of group characteristics such as \verb++ , \verb++ etc. Here the function sum is applied to \verb++ that is the second and third columns of \verb++ by the group identity \verb++. <>= (aggregate.sum.df1 <- aggregate(df1[,2:3],list(df1$V1),sum ) ) (aggregate.mean.df1 <- aggregate(df1[,2:3],list(df1$V1),mean)) @ The variable is a factor that identifies groups. The following is an example of using the function aggregate. Assume that you have a data set \verb++ including a unit-identifier \verb++. The units are observed repeatedly over time indicated by a variable \verb+dat$Time+. <>= (dat <- data.frame(id=rep(11:12,each=2), Time=1:2, x=2:3, y =5:6)) @ This computes group means for all variables in the data frame and drops the variable \verb++. \subsection{Tabulation} \htmladdnormallink{Read a dataset}{http://people.su.se/~ma/R_intro/data/} first. Cross Tabulation <>= attach(lnu,warn.conflicts=FALSE) table(female,public) # yields frequencies (ftable.row <- cbind(table(female,public), total=table(female))) (ftable.col <- rbind(table(female,public), total=table(public))) @ {\small \begin{verbatim} # Try this: # relative freq. by rows: female ftable.row/c(table(female)) # relative freq. by columns: public ftable.col/rep(table(public),each=3) # rep(table(public),each=3) repeats #each value in table(public) 3 times \end{verbatim}} Creating various statistics by category. The following yields average wage for males and females. <>= tapply(wage, female, mean) @ Using \verb+, , +, etc yields number of observations, minimum, maximum etc for males and females. <>= tapply(wage, female, length) @ The following example yields average wage for males and females in the private and public sector. <>= tapply(wage, list(female,public), mean) @ The following computes the average by group creating a vector of the same length. Same length implies that for the group statistics is retained for all members of each group. Average wage for males and females: <>= attach(lnu, warn.conflicts=FALSE) lnu$wage.by.sex<- ave(wage,female,FUN=mean) @ The function \verb++ can be substituted with \verb+, , + etc. yielding group-wise minimum, maximum, number of observations, etc. \section{Matrixes} In R we define a matrix as follows (see ?matrix in R): A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns. <>= matrix(1:12,3,4) @ A matrix with 3 rows and 4 columns with elements 1,2,3, ..., 12 filled by rows: <>= (A <- matrix(1:12,3,4,byrow=TRUE)) dim(A) # Dimension of a matrix nrow(A) # Number of rows, same as dim(A)[1] ncol(A) # Number of columns, same as dim(A)[2] @ \subsection{Indexation} The elements of a matrix can be extracted by using brackets after the matrix name and referring to rows and columns separated by a comma. You can use the indexation in a similar way to extract elements of other types of objects. \begin{verbatim} A[3,] # Extracting the third row A[,3] # Extracting the third column A[3,3] # the third row and the third column A[-1,] # the matrix except the first row A[,-2] # the matrix except the second column \end{verbatim} Evaluating some condition on all elements of a matrix <>= A>3 # Elements greater than 3 A==3 # Elements equal to 3 @ Listing the elements fulfilling some condition <>= A[A>6] # all elements greater than 6 @ \subsection{Scalar Matrix} A special type of matrix is a scalar matrix which is a square matrix with the same number of rows and columns, all off-diagonal elements equal to zero and the same element in all diagonal positions. The following exercises demonstrates some matrix facilities regarding the diagonals of matrixes. See also \verb+?upper.tri+ and \verb+?lower.tri+. <>= diag(2,3,3) diag(diag(2,3,3)) @ \subsection{Matrix operators} \textbf{Transpose of a matrix} Interchanging the rows and columns of a matrix yields the transpose of a matrix. <>= t(matrix(1:6,2,3)) # @ Try matrix(1:6,2,3) and matrix(1:6,3,2, byrow=T). \textbf{Addition and subtraction} Addition and subtraction can be applied on matrixes of the same dimensions or a scalar and a matrix. {\small \begin{verbatim} # Try this A <- matrix(1:12,3,4) B <- matrix(-1:-12,3,4) C1 <- A+B D1 <- A-B \end{verbatim}} \textbf{Scalar multiplication} {\small \begin{verbatim} # Try this A <- matrix(1:12,3,4); TwoTimesA = 2*A c(2,2,2)*A c(1,2,3)*A c(1,10)*A \end{verbatim}} \textbf{Matrix multiplication} For multiplying matrixes R uses <$\%*\%$> and this works only when the matrixes are conform. {\small \begin{verbatim} E <- matrix(1:9,3,3) crossproduct.of.E <- t(E)%*%E # Or another and more efficient way of obtaining crossproducts is: crossproduct.of.E <- crossprod(E) \end{verbatim}} \textbf{Matrix inversion} The inverse of a square matrix $\textbf{A}$ denoted as $\textbf{A}^{-1}$ is defined as a matrix that when multiplied with $\textbf{A}$ results in an Identity matrix (1's in the diagonal and 0's in all off-diagonal elements.) \begin{equation} \textbf{A}\textbf{A}^{-1}=\textbf{A}^{-1}\textbf{A}=\textbf{I}\nonumber \end{equation} {\small \begin{verbatim} FF <- matrix((1:9),3,3) detFF<- det(FF) # we check the determinant B <- matrix((1:9)^2,3,3) # create an invertible matrix Binverse <- solve(B) Identity.matrix <- B%*%Binverse \end{verbatim}} \section{Ordinary Least Squares} The function for running a linear regression model using OLS is \verb++. In the following example the dependent variable is \verb++ and the explanatory variables are \verb++ and \verb++. An intercept is included by default. Notice that we do not have to specify the data since the data frame \verb++ containing these variables is attached. The result of the regression is assigned to the object named \verb++. This object includes a number of interesting regression results that can be extracted as illustrated further below after some examples for using \verb++. \htmladdnormallink{Read a dataset} {http://people.su.se/~ma/R_intro/lnu91.txt} first. <>= reg.model <- lm (log(wage) ~ school + female) summary (reg.model) @ Sometimes we wish to run the regression on a subset of our data. {\small \begin{verbatim} lm (log(wage) ~ school + female, subset=wage>100) \end{verbatim}} Sometimes we need to use transformed values of the variables in the model. The transformation should be given as the in the function . I() means Identity function. \verb++ is \verb++ squared. {\small \begin{verbatim} lm (log(wage) ~ school + female + expr + I(expr^2)) \end{verbatim}} Interacting variables: \verb++, \verb++ {\small \begin{verbatim} lm (log(wage) ~ female*school, data=lnu) \end{verbatim}} Same as: {\small \begin{verbatim} lm (log(wage) ~ female + school + female:school, data= lnu) \end{verbatim}} A model with no intercept. {\small \begin{verbatim} reg.no.intercept <- lm (log(wage) ~ female - 1) \end{verbatim}} A model with only an intercept. {\small \begin{verbatim} reg.only.intercept <- lm (log(wage) ~ 1 ) \end{verbatim}} The following example runs \verb++ for females and males in the private and public sector separately as defined by the variables \verb++ and \verb++. The data \verb++ are split into four cells: males in private (0,0), females in private (1,0), males in public (0,1) and females in public (1,1). The obtained object \verb++ is a list and to display the \verb++ of each element we use \verb++ (list apply). {\small \begin{verbatim} by.reg <- by(lnu, list(female,public), function(x) lm(log(wage) ~ school, data=x)) # summary of the separate regressions lapply(by.reg, summary) # summary for the second element in the #list i.e females in private sector. summary(by.reg[[2]]) \end{verbatim}} The following lists mean of variables for male and female workers (the first line), creates a list named \verb+by.female.lnu+ of two data sets (the second line) and runs regressions for male and female workers (the third and fourth lines). {\small \begin{verbatim} by(lnu, list(female), mean) by.female.lnu <- by(lnu, list(female), function(x) x); str(by.female.lnu) summary(lm(log(wage) ~ school, data=by.female.lnu[[1]])) summary(lm(log(wage) ~ school, data=by.female.lnu[[2]])) \end{verbatim}} \subsection{Extracting the model formula and results} The model formula {\small \begin{verbatim} (equation1 <- formula(reg.model)) log(wage) ~ school + female \end{verbatim}} The estimated coefficients <>= coefficients(reg.model) # can be abbreviated as @ The standard errors <>= coef(summary(reg.model))[,2] @ \verb++ yields both \verb++ and \verb++ The t-values <>= coef(summary(reg.model))[,3] @ Try also \verb++. Analogously you can extract other elements of the lm-object by: The variance-covariance matrix: \verb++ : Residual degrees of freedom:\\ \verb++ \\ The residual sum of squares:\\ \verb++ And other components:\\ \verb++\\ \verb++\\ \verb++\\ \verb++\\ \verb++\\ \verb++\\ \subsection{White's heteroskedasticity corrected standard errors} The package \verb++ and \verb+ + and \verb+ + has predefined functions for computing robust standard errors. There are different weighting options. The White's correction <>= library(car) f1 <- formula(log(wage) ~ female +school) sqrt(diag(hccm(lm(f1),type="hc0"))) @ Using the library . <>= library(sandwich) library(lmtest) coeftest(lm(f1), vcov=(vcovHC(lm(f1), "HC0"))) @ \verb++ in library and \verb++ in library sandwich use the original White formula. The \verb++ \verb++ multiply the variances with $\frac{N}{N-k}$. Using library . <>= library(Design, war, warn.conflicts = FALSE) f1 <- formula(log(wage) ~ female +school) fm1 <- robcov(ols(f1, x=TRUE, y=TRUE)) @ \subsection{Group-wise non-constant error variance} This uses the library and account for non-constant error variance when data is clustered across sectors indicated by the variable <>= robcov(ols(f1, x=TRUE, y=TRUE), cluster=industry) @ When the number of groups M are small, you can correct the standard errors as follows: <>= library(Design) f1 <- formula(log(wage) ~ female +school) M <- length(unique(industry)) N <- length(industry) K <- lm(f1)$rank cl <- (M/(M-1))*((N-1)/(N-K)) fm1 <- robcov(ols(f1, x=TRUE, y=TRUE), cluster=industry) fm1$var <- fm1$var*cl coeftest(fm1) @ \subsection{F-test} Estimate the restricted (restricting some (or all) of slope coefficients to be zero) and the unrestricted model (allowing non-zero as well as zero coefficients). You can then use \verb+anova()+ to test the joint hypotheses defined as in the restricted model. <>= mod.restricted <- lm(log(wage) ~ 1) mod.unrestricted <- lm(log(wage) ~ female + school) anova(mod.restricted,mod.unrestricted) @ Under non-constant error variance, we use the White variance-Covariance matrix and the model F-value is as follows. The \verb+<-1>+ in the codes below remove related row/column for the intercept <>= library(car) COV <- hccm(mod.unrestricted, "hc1")[-1,-1] beta <- matrix(coef(mod.unrestricted, ,1))[-1,] t(beta)%*%solve(COV)%*%beta/(lm(f1)$rank -1) @ \newpage \section{Time-series} \label{TimeSeries} \textit{Data in rows} Time-series data often appear in a form where series are in rows. As an example I use data from the Swedish Consumer Surveys by the \htmladdnormallink{National Institute of Economic Research}{http://www.konj.se} containing three series: consumer confidence index, a macro index and a micro index. First I saved the original data in a file in text-format. Before using the data as input in \R\ I used a text editor and kept only the values for the first three series separated with spaces. The series are in rows. The values of the three series are listed in separate rows without variable names. To read this file \texttt{\htmladdnormallink{} {http://people.su.se/~ma/R_intro/data/macro.txt}}, the following code puts the data in a matrix of 3 columns and 129 rows with help of \verb++ and \verb++ before defining a time-series object starting in 1993 with frequency 12 (monthly). The series are named as \verb+,+ and \verb++. Notice that \verb++ by default fills in data by columns. <>= FILE <- "http://people.su.se/~ma/R_intro/macro.txt" macro <-ts( matrix( scan(FILE),129,3) , start=1993,frequency=12, names=c("cci","macro.index", "micro.index")) @ Here I give an example for creating lag values of a variable and adding it to a time-series data set. See also \verb++ for computing differences. Let us create a new time series data set, with the series in the data frame \verb++ adding lagged \verb++ (lagged by 1 month). The function \verb++ puts together the series keeping all observations while \verb++ would keep only the overlapping part of the series. <>= macro2 <- ts.union( macro, l.cci = lag(macro[,1],-1)) @ You can use the function aggregate to change the frequency of you time-series data. The following example converts the frequency data. \verb+nfrequency=1+ yields annual data. \verb+FUN=mean+ computes the average of the variables over time. The default is \verb++. <>= aggregate(macro,nfrequency=1,FUN=mean) @ \subsection{Durbin Watson} \verb++ in the package \verb++ and \verb++ in the package \verb++ can be used. See also \verb++ in the package \verb++ for Breusch-Godfrey test for higher order serial correlation. <>= # Fitting the model. mod1 <- lm(cci ~ macro.index, data=macro) library(lmtest) dwtest(mod1) @ \section{Graphics} \subsection{Save graphs in postscript} {\small \begin{verbatim} postscript("myfile.ps") hist(1:10) dev.off() \end{verbatim}} \subsection{Save graphs in pdf} {\small \begin{verbatim} pdf("myfile.pdf") hist(1:10) dev.off() \end{verbatim}} \newpage \subsection{Plotting the observations and the Regression line} Plot the data and the regression line. Plot \verb++ against \verb++ <>= X.LABEL= "Years of schooling" Y.LABEL= "Log hourly wage in SEK" TITLE <- "Figure 1: Scatterplot and the Regression line" SubTitle <- "Source: Level of Living Surveys, LNU, 1991" plot(school,log(wage), pch=".", main = TITLE, sub = SubTitle, xlab=X.LABEL, ylab=Y.LABEL) abline(lm(log(wage) ~ school )) abline(v = mean(school), col="red") abline(h = mean(log(wage)), col="red") @ \newpage \subsection{Plotting time series} Start by reading a time-series data set. For details see section \ref{TimeSeries}. Plot series in one diagram <>= TITLE <- "Figure 2: Consumer confidence index in Sweden" SubTitle <- "Source: Swedish Consumer Surveys" X.LABEL <- "Year" COLORS = c("red","blue","black") ts.plot(macro, col=COLORS, main = TITLE, sub = SubTitle, xlab=X.LABEL) legend("bottomright",legend=colnames(macro), lty=1,col=COLORS) @ \verb+ + plots series separately. \section{Writing functions} The syntax is: \verb+ myfunction <- function(x, a, ...) \{...\}+ The arguments for a function are the variables used in the operations as specified in the body of the function i.e. the codes within \{ \}. Once you have written a function and saved it, you can use this function to perform the operation as specified in \{ ...\} by referring to your function and using the arguments relevant for the actual computation. The following function computes the squared of mean of a variable. By defining the function \verb++ we can write \verb++ instead of \verb+<(mean(x))^2)>+ every time we want to compute the square of mean for a variable . <>= ms <- function(x) {(mean(x))^2} a <- 1:100 ms(a) @ \textit{The arguments of a function:} The following function has no arguments and prints the string of text, \verb++ <>= welc <- function() {print("Welcome")} welc() @ This function takes an argument \verb+x+. The arguments of the function must be supplied. <>= myprog.no.default <- function(x) print(paste("I use", x ,"for statistical computation.")) @ If a default value is specified, the default value is assumed when no arguments are supplied. <>= myprog <- function(x="R") {print(paste("I use", x ,"for statistical computation."))} myprog() myprog("R and sometimes something else") @ \subsection{A function for computing Clustered Standard Errors} Here follows a function for computing clustered-Standard Errors. (See also the function \verb+robcov+ in the library \verb+Design+ discussed above.) The arguments are a data frame \verb++, a model formula\verb++, and the cluster variable \verb++. {\small \begin{verbatim} clustered.standard.errors <- function(dat,f1, cluster){ attach(dat, warn.conflicts = FALSE) M <- length(unique(cluster)) N <- length(cluster) K <- lm(f1)$rank cl <- (M/(M-1))*((N-1)/(N-K)) X <- model.matrix(f1) invXpX <- solve(t(X) %*% X) ei <- resid(lm(f1)) uj <- as.matrix(aggregate(ei*X,list(cluster),FUN=sum)[-1]) sqrt(cl*diag(invXpX%*%t(uj)%*%uj%*%invXpX)) } \end{verbatim}} Notice that substituting the last line with\\ \verb+ sqrt( diag(invXpX %*%t(ei*X)%*%(X*ei)%*%invXpX) )+ \\ would yield White's standard errors. \section{Miscellaneous hints} \begin{tabular}{ll} Income Distribution & see \htmladdnormallink{ineq}{http://cran.at.r-project.org/src/ contrib/Descriptions/ineq.html}.\\ Logit & \verb+.+ \\ & See \verb+ & . +\\ Negative binomial & \verb++ in \htmladdnormallink{MASS, VR}{http://cran.at.r-project.org/src/contrib/Descriptions/VR .html}.\\ Poisson regression & \verb+.+\\ & See \verb+ & .+\\ Probit & \verb+.+ \\ & See \verb+ & . +\\ Simultaneous Equations & see \htmladdnormallink{ sem}{http://cran.at.r-project.org/src/contrib/Descriptions/ sem.html}, \htmladdnormallink{ systemfit}{http://cran.at.r-project.org/src/contrib/ Descriptions/systemfit.html}.\\ Time Series & see \verb++ \htmladdnormallink{tseries}{http://cran.at.r-project.org/src /contrib/Descriptions/tseries.html}, \htmladdnormallink{urca}{http://cran.at.r-project.org/src/ contrib/Descriptions/urca.html} and \htmladdnormallink{strucchange}{http://cran.at.r-project.org /src/contrib/Descriptions/strucchange.html}.\\ Tobit & \verb+see + in \htmladdnormallink{ survival}{http://cran.at.r-project.org/src/contrib/ Descriptions/survival.html}.\\ \end{tabular} \section{Acknowledgements} I am grateful to Michael Lundholm , Lena Nekby and Achim Zeileis for helpful comments. \newpage \bibliography{LittBibRintro} \bibliographystyle{plainnat} \end{document} <>= write(paste( "@BOOK{AER, AUTHOR = {Christian Kleiber and Achim Zeileis}, TITLE = {Applied Econometrics with R}, PUBLISHER = {Springer}, YEAR = 2008, ADDRESS = {New York}, NOTE = {ISBN 978-0-387-77316-2}, URL = {http://www.springer.com/978-0-387-77316-2}}" , "@Book{Dalgaard, author = {Peter Dalgaard}, title = {Introductory Statistics with {R}}, edition = {2nd}, year = 2008, publisher = {Springer}, note = {ISBN 978-0-387-79053-4}, pages = 380, url = {http://www.biostat.ku.dk/~pd/ISwR.html}, publisherurl = {http://www.springer.com/statistics/computational/book/978-0 -387-79053-4}}" , "@Book{FOX, author = {John Fox}, title = {An {R} and {S-Plus} Companion to Applied Regression}, publisher = {Sage Publications}, year = 2002, address = {Thousand Oaks, CA, USA}, note = {ISBN 0-761-92279-2}, url = {http://socserv.socsci.mcmaster.ca/jfox/Books/Companion/ index.html}}" , "@Manual{Rcore, title = {R: A Language and Environment for Statistical Computing}, author = {{R Development Core Team}}, organization = {R Foundation for Statistical Computing}, address = {Vienna, Austria}, year = {2008}, note = {{ISBN} 3-900051-07-0}, url = {http://www.R-project.org} }" , "@Unpublished{Koenker07, author = {Koenker, Roger and Zeileis, Achim}, year = {2007}, title = {Reproducible Econometric Research (A Critical Review of the State of the Art)}, note = {Report 60, Department of Statistics and Mathematics, Wirtschaftsuniversität Wien, Research Report Series}, url = {http://www.econ.uiuc.edu/~roger/research/repro/}}" , "@Book{VR, author = {William N. Venables and Brian D. Ripley}, title = {Modern Applied Statistics with {S}. Fourth Edition}, publisher = {Springer}, year = 2002, address = {New York}, note = {ISBN 0-387-95457-0}, url = {http://www.stats.ox.ac.uk/pub/MASS4/}, publisherurl = {http://www.springeronline.com/sgw/cda/frontpage/0,11855,4- 40109-22-1542120-0,00.html}}" , sep="\n"), file="LittBibRintro.bib") @