Chapter 7 Statistical models

In this chapter, we will not learn about all the models out there that you may or may not need. Instead, I will show you how can use what you have learned until now and how you can apply these concepts to modeling. Also, as you read in the beginning of the book, R has many many packages. So the model you need is most probably already implemented in some package.

7.1 Fitting a model to data

Suppose you have a variable y that you wish to explain using a set of other variables x1, x2, x3, etc. Let’s take a look at the Housing dataset from the Ecdat package:

You can read a description of the dataset by running:

Housing                 package:Ecdat                  R Documentation

Sales Prices of Houses in the City of Windsor

Description:

     a cross-section from 1987

     _number of observations_ : 546

     _observation_ : goods

     _country_ : Canada

Usage:

     data(Housing)

Format:

     A dataframe containing :

     price: sale price of a house

     lotsize: the lot size of a property in square feet

     bedrooms: number of bedrooms

     bathrms: number of full bathrooms

     stories: number of stories excluding basement

     driveway: does the house has a driveway ?

     recroom: does the house has a recreational room ?

     fullbase: does the house has a full finished basement ?

     gashw: does the house uses gas for hot water heating ?

     airco: does the house has central air conditioning ?

     garagepl: number of garage places

     prefarea: is the house located in the preferred neighbourhood of the city ?

Source:

     Anglin, P.M.  and R.  Gencay (1996) “Semiparametric estimation of
     a hedonic price function”, _Journal of Applied Econometrics_,
     *11(6)*, 633-648.

References:

     Verbeek, Marno (2004) _A Guide to Modern Econometrics_, John Wiley
     and Sons, chapter 3.

     Journal of Applied Econometrics data archive : <URL:
     http://qed.econ.queensu.ca/jae/>.

See Also:

     ‘Index.Source’, ‘Index.Economics’, ‘Index.Econometrics’,
     ‘Index.Observations’

or by looking for Housing in the help pane of RStudio. Usually, you would take a look a the data before doing any modeling:

## Observations: 546
## Variables: 12
## $ price    <dbl> 42000, 38500, 49500, 60500, 61000, 66000, 66000, 6900...
## $ lotsize  <dbl> 5850, 4000, 3060, 6650, 6360, 4160, 3880, 4160, 4800,...
## $ bedrooms <dbl> 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 2, 2, 3, 4,...
## $ bathrms  <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1,...
## $ stories  <dbl> 2, 1, 1, 2, 1, 1, 2, 3, 1, 4, 1, 1, 2, 1, 1, 1, 2, 3,...
## $ driveway <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes...
## $ recroom  <fct> no, no, no, yes, no, yes, no, no, yes, yes, no, no, n...
## $ fullbase <fct> yes, no, no, no, no, yes, yes, no, yes, no, yes, no, ...
## $ gashw    <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, n...
## $ airco    <fct> no, no, no, no, no, yes, no, no, no, yes, yes, no, no...
## $ garagepl <dbl> 1, 0, 0, 0, 0, 0, 2, 0, 0, 1, 3, 0, 0, 0, 0, 0, 1, 0,...
## $ prefarea <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, n...

Housing prices depend on a set of variables such as the number of bedrooms, the area it is located and so on. If you believe that housing prices depend linearly on a set of explanatory variables, you will want to estimate a linear model. To estimate a linear model, you will need to use the built-in lm() function:

lm() takes a formula as an argument, which defines the model you want to estimate. In this case, I ran the following regression:

\[ \text{price} = \alpha + \beta_1 * \text{lotsize} + \beta_2 * \text{bedrooms} + \varepsilon \]

where \(alpha, beta_1\) and \(beta_2\) are three parameters to estimate. To take a look at the results, you can use the summary() method (not to be confused with dplyr::summarise():

## 
## Call:
## lm(formula = price ~ lotsize + bedrooms, data = Housing)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65665 -12498  -2075   8970  97205 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.613e+03  4.103e+03   1.368    0.172    
## lotsize     6.053e+00  4.243e-01  14.265  < 2e-16 ***
## bedrooms    1.057e+04  1.248e+03   8.470 2.31e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21230 on 543 degrees of freedom
## Multiple R-squared:  0.3703, Adjusted R-squared:  0.3679 
## F-statistic: 159.6 on 2 and 543 DF,  p-value: < 2.2e-16

if you wish to remove the intercept (\(alpha\)) from your model, you can do so with -1:

## 
## Call:
## lm(formula = price ~ -1 + lotsize + bedrooms, data = Housing)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67229 -12342  -1333   9627  95509 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## lotsize      6.283      0.390   16.11   <2e-16 ***
## bedrooms 11968.362    713.194   16.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21250 on 544 degrees of freedom
## Multiple R-squared:  0.916,  Adjusted R-squared:  0.9157 
## F-statistic:  2965 on 2 and 544 DF,  p-value: < 2.2e-16

or if you want to use all the columns inside Housing:

## 
## Call:
## lm(formula = price ~ ., data = Housing)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41389  -9307   -591   7353  74875 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4038.3504  3409.4713  -1.184 0.236762    
## lotsize         3.5463     0.3503  10.124  < 2e-16 ***
## bedrooms     1832.0035  1047.0002   1.750 0.080733 .  
## bathrms     14335.5585  1489.9209   9.622  < 2e-16 ***
## stories      6556.9457   925.2899   7.086 4.37e-12 ***
## drivewayyes  6687.7789  2045.2458   3.270 0.001145 ** 
## recroomyes   4511.2838  1899.9577   2.374 0.017929 *  
## fullbaseyes  5452.3855  1588.0239   3.433 0.000642 ***
## gashwyes    12831.4063  3217.5971   3.988 7.60e-05 ***
## aircoyes    12632.8904  1555.0211   8.124 3.15e-15 ***
## garagepl     4244.8290   840.5442   5.050 6.07e-07 ***
## prefareayes  9369.5132  1669.0907   5.614 3.19e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15420 on 534 degrees of freedom
## Multiple R-squared:  0.6731, Adjusted R-squared:  0.6664 
## F-statistic: 99.97 on 11 and 534 DF,  p-value: < 2.2e-16

You can access different elements of model3 (for example) with $, because the result of lm() is a list:

##  (Intercept)      lotsize     bedrooms      bathrms      stories 
## -4038.350425     3.546303  1832.003466 14335.558468  6556.945711 
##  drivewayyes   recroomyes  fullbaseyes     gashwyes     aircoyes 
##  6687.778890  4511.283826  5452.385539 12831.406266 12632.890405 
##     garagepl  prefareayes 
##  4244.829004  9369.513239

but I prefer to use the broom package, and more specifically the tidy() function, which converts model3 into a neat data.frame:

## Observations: 12
## Variables: 5
## $ term      <chr> "(Intercept)", "lotsize", "bedrooms", "bathrms", "st...
## $ estimate  <dbl> -4038.350425, 3.546303, 1832.003466, 14335.558468, 6...
## $ std.error <dbl> 3409.4713, 0.3503, 1047.0002, 1489.9209, 925.2899, 2...
## $ statistic <dbl> -1.184451, 10.123618, 1.749764, 9.621691, 7.086369, ...
## $ p.value   <dbl> 2.367616e-01, 3.732442e-22, 8.073341e-02, 2.570369e-...

this is useful, because you can then work on the results easily, for example if you wish to only keep results that are significant at the 5% level:

## # A tibble: 10 x 5
##    term        estimate std.error statistic  p.value
##    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
##  1 lotsize         3.55     0.350     10.1  3.73e-22
##  2 bathrms     14336.    1490.         9.62 2.57e-20
##  3 stories      6557.     925.         7.09 4.37e-12
##  4 drivewayyes  6688.    2045.         3.27 1.15e- 3
##  5 recroomyes   4511.    1900.         2.37 1.79e- 2
##  6 fullbaseyes  5452.    1588.         3.43 6.42e- 4
##  7 gashwyes    12831.    3218.         3.99 7.60e- 5
##  8 aircoyes    12633.    1555.         8.12 3.15e-15
##  9 garagepl     4245.     841.         5.05 6.07e- 7
## 10 prefareayes  9370.    1669.         5.61 3.19e- 8

You can even add new columns, such as the confidence intervals:

## # A tibble: 12 x 7
##    term         estimate std.error statistic  p.value  conf.low conf.high
##    <chr>           <dbl>     <dbl>     <dbl>    <dbl>     <dbl>     <dbl>
##  1 (Intercept)  -4038.    3409.        -1.18 2.37e- 1 -10736.     2659.  
##  2 lotsize          3.55     0.350     10.1  3.73e-22      2.86      4.23
##  3 bedrooms      1832.    1047.         1.75 8.07e- 2   -225.     3889.  
##  4 bathrms      14336.    1490.         9.62 2.57e-20  11409.    17262.  
##  5 stories       6557.     925.         7.09 4.37e-12   4739.     8375.  
##  6 drivewayyes   6688.    2045.         3.27 1.15e- 3   2670.    10705.  
##  7 recroomyes    4511.    1900.         2.37 1.79e- 2    779.     8244.  
##  8 fullbaseyes   5452.    1588.         3.43 6.42e- 4   2333.     8572.  
##  9 gashwyes     12831.    3218.         3.99 7.60e- 5   6511.    19152.  
## 10 aircoyes     12633.    1555.         8.12 3.15e-15   9578.    15688.  
## 11 garagepl      4245.     841.         5.05 6.07e- 7   2594.     5896.  
## 12 prefareayes   9370.    1669.         5.61 3.19e- 8   6091.    12648.

Going back to model estimation, you can of course use lm() in a pipe workflow:

## # A tibble: 10 x 5
##    term        estimate std.error statistic  p.value
##    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)  3025.    3263.        0.927 3.54e- 1
##  2 lotsize         3.67     0.363    10.1   4.52e-22
##  3 bedrooms     4140.    1036.        3.99  7.38e- 5
##  4 bathrms     16443.    1546.       10.6   4.29e-24
##  5 recroomyes   5660.    2010.        2.82  5.05e- 3
##  6 fullbaseyes  2241.    1618.        1.38  1.67e- 1
##  7 gashwyes    13568.    3411.        3.98  7.93e- 5
##  8 aircoyes    15578.    1597.        9.75  8.53e-21
##  9 garagepl     4232.     883.        4.79  2.12e- 6
## 10 prefareayes 10729.    1753.        6.12  1.81e- 9

The first . in the lm() function is used to indicate that we wish to use all the data from Housing (minus driveway and stories which I removed using select() and the - sign), and the second . is used to place the result from the two dplyr instructions that preceded is to be placed there. The picture below should help you understand:

You have to specify this, because by default, when using %>% the left hand side argument gets passed as the first argument of the function on the right hand side.

7.2 Diagnostics

Diagnostics are useful metrics to assess model fit. You can read some of these diagnostics, such as the \(R^2\) at the bottom of the summary (when running summary(my_model)), but if you want to do more than simply reading these diagnostics from RStudio, you can put those in a data.frame too, using broom::glance():

## # A tibble: 1 x 11
##   r.squared adj.r.squared  sigma statistic   p.value    df logLik    AIC
## *     <dbl>         <dbl>  <dbl>     <dbl>     <dbl> <int>  <dbl>  <dbl>
## 1     0.673         0.666 15423.     100.0 6.18e-122    12 -6034. 12094.
## # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>

You can also plot the usual diagnostics plots using ggfortify::autoplot() which uses the ggplot2 package under the hood:

which=1:6 is an additional option that shows you all the diagnostics plot. If you omit this option, you will only get 4 of them.

You can also get the residuals of the regression in two ways; either you grab them directly from the model fit:

or you can augment the original data with a residuals column, using broom::augment():

Let’s take a look at housing_aug:

## Observations: 546
## Variables: 19
## $ price      <dbl> 42000, 38500, 49500, 60500, 61000, 66000, 66000, 69...
## $ lotsize    <dbl> 5850, 4000, 3060, 6650, 6360, 4160, 3880, 4160, 480...
## $ bedrooms   <dbl> 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 2, 2, 3, ...
## $ bathrms    <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, ...
## $ stories    <dbl> 2, 1, 1, 2, 1, 1, 2, 3, 1, 4, 1, 1, 2, 1, 1, 1, 2, ...
## $ driveway   <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ recroom    <fct> no, no, no, yes, no, yes, no, no, yes, yes, no, no,...
## $ fullbase   <fct> yes, no, no, no, no, yes, yes, no, yes, no, yes, no...
## $ gashw      <fct> no, no, no, no, no, no, no, no, no, no, no, no, no,...
## $ airco      <fct> no, no, no, no, no, yes, no, no, no, yes, yes, no, ...
## $ garagepl   <dbl> 1, 0, 0, 0, 0, 0, 2, 0, 0, 1, 3, 0, 0, 0, 0, 0, 1, ...
## $ prefarea   <fct> no, no, no, no, no, no, no, no, no, no, no, no, no,...
## $ .fitted    <dbl> 66037.98, 41391.15, 39889.63, 63689.09, 49760.43, 6...
## $ .se.fit    <dbl> 1790.507, 1406.500, 1534.102, 2262.056, 1567.689, 2...
## $ .resid     <dbl> -24037.9757, -2891.1515, 9610.3699, -3189.0873, 112...
## $ .hat       <dbl> 0.013477335, 0.008316321, 0.009893730, 0.021510891,...
## $ .sigma     <dbl> 15402.01, 15437.14, 15431.98, 15437.02, 15429.89, 1...
## $ .cooksd    <dbl> 2.803214e-03, 2.476265e-05, 3.265481e-04, 8.004787e...
## $ .std.resid <dbl> -1.56917096, -0.18823924, 0.62621736, -0.20903274, ...

A few columns have been added to the original data, among them .resid which contains the residuals. Let’s plot them:

Fitted values are also added to the original data, under the variable .fitted. It would also have been possible to get the fitted values with:

but I prefer using augment(), because the columns get merged to the original data, which then makes it easier to find specific individuals, for example, you might want to know for which housing units the model underestimates the price:

we find 261 individuals where the residuals are positive. It is also easier to extract outliers:

## Observations: 6
## Variables: 20
## $ price      <dbl> 163000, 125000, 132000, 175000, 190000, 174500
## $ lotsize    <dbl> 7420, 4320, 3500, 9960, 7420, 7500
## $ bedrooms   <dbl> 4, 3, 4, 3, 4, 4
## $ bathrms    <dbl> 1, 1, 2, 2, 2, 2
## $ stories    <dbl> 2, 2, 2, 2, 3, 2
## $ driveway   <fct> yes, yes, yes, yes, yes, yes
## $ recroom    <fct> yes, no, no, no, no, no
## $ fullbase   <fct> yes, yes, no, yes, no, yes
## $ gashw      <fct> no, yes, yes, no, no, no
## $ airco      <fct> yes, no, no, no, yes, yes
## $ garagepl   <dbl> 2, 2, 2, 2, 2, 3
## $ prefarea   <fct> no, no, no, yes, yes, yes
## $ .fitted    <dbl> 94826.68, 77688.37, 85495.58, 108563.18, 115125.03,...
## $ .se.fit    <dbl> 2520.691, 3551.954, 3544.961, 2589.680, 2185.603, 2...
## $ .resid     <dbl> 68173.32, 47311.63, 46504.42, 66436.82, 74874.97, 5...
## $ .hat       <dbl> 0.02671105, 0.05303793, 0.05282929, 0.02819317, 0.0...
## $ .sigma     <dbl> 15144.70, 15293.34, 15298.27, 15159.14, 15085.99, 1...
## $ .cooksd    <dbl> 0.04590995, 0.04637969, 0.04461464, 0.04616068, 0.0...
## $ .std.resid <dbl> 4.480428, 3.152300, 3.098176, 4.369631, 4.904193, 3...
## $ prank      <dbl> 0.9963370, 1.0000000, 0.9945055, 0.9981685, 0.99267...

prank is a variable I created with cume_dist() which is a dplyr function that returns the proportion of all values less than or equal to the current rank. For example:

## [1] 1.0000000 0.8571429 0.7142857 0.5714286 0.4285714 0.2857143 0.1428571

by filtering prank > 0.99 we get the top 1% of outliers according to Cook’s distance.

7.3 Interpreting models

Model interpretation is essential in the social sciences. If one wants to know the effect of variable x on the dependent variable y, marginal effects have to be computed. This is easily done in R with the margins package, which aims to provide the same functionality as the margins command in STATA:

##       factor        AME        SE       z      p      lower      upper
##     aircoyes 12632.8904 1555.0329  8.1239 0.0000  9585.0819 15680.6989
##      bathrms 14335.5585 1482.9885  9.6667 0.0000 11428.9545 17242.1624
##     bedrooms  1832.0035 1045.6558  1.7520 0.0798  -217.4442  3881.4512
##  drivewayyes  6687.7789 2045.2636  3.2699 0.0011  2679.1359 10696.4219
##  fullbaseyes  5452.3855 1587.9782  3.4335 0.0006  2340.0054  8564.7657
##     garagepl  4244.8290  847.2173  5.0103 0.0000  2584.3136  5905.3444
##     gashwyes 12831.4063 3217.6211  3.9879 0.0001  6524.9848 19137.8277
##      lotsize     3.5463    0.3503 10.1250 0.0000     2.8598     4.2328
##  prefareayes  9369.5132 1669.1034  5.6135 0.0000  6098.1307 12640.8957
##   recroomyes  4511.2838 1899.9255  2.3745 0.0176   787.4982  8235.0694
##      stories  6556.9457  924.4211  7.0930 0.0000  4745.1137  8368.7777

It is also possible to plot the results:

This uses the basic R plotting capabilities, which is useful because it is a simple call to the function plot() but if you’ve been using ggplot2 and want this graph to have the same feel as the others made with ggplot2 you first need to save the summary in a variable. summary(effects_model3) is a data.frame with many more details. Let’s overwrite this effects_model3 with its summary:

And now it is possible to use ggplot2 to have the same plot:

Of course for model3, the marginal effects are the same as the coefficients, so let’s estimate a logit model and compute the marginal effects. Logit models can be estimated using the glm() function. As an example, we are going to use the Participation data, also from the Ecdat package:

Participation              package:Ecdat               R Documentation

Labor Force Participation

Description:

     a cross-section

     _number of observations_ : 872

     _observation_ : individuals

     _country_ : Switzerland

Usage:

     data(Participation)

Format:

     A dataframe containing :

     lfp labour force participation ?

     lnnlinc the log of nonlabour income

     age age in years divided by 10

     educ years of formal education

     nyc the number of young children (younger than 7)

     noc number of older children

     foreign foreigner ?

Source:

     Gerfin, Michael (1996) “Parametric and semiparametric estimation
     of the binary response”, _Journal of Applied Econometrics_,
     *11(3)*, 321-340.

References:

     Davidson, R.  and James G.  MacKinnon (2004) _Econometric Theory
     and Methods_, New York, Oxford University Press, <URL:
     http://www.econ.queensu.ca/ETM/>, chapter 11.

     Journal of Applied Econometrics data archive : <URL:
     http://qed.econ.queensu.ca/jae/>.

See Also:

     ‘Index.Source’, ‘Index.Economics’, ‘Index.Econometrics’,
     ‘Index.Observations’

The variable of interest is lfp: whether the individual participates in the labour force. To know which variables are relevant in the decision to participate in the labour force, one could estimate a logit model, using glm().

## # A tibble: 7 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  10.4       2.17       4.79  1.69e- 6
## 2 lnnlinc      -0.815     0.206     -3.97  7.31e- 5
## 3 age          -0.510     0.0905    -5.64  1.72e- 8
## 4 educ          0.0317    0.0290     1.09  2.75e- 1
## 5 nyc          -1.33      0.180     -7.39  1.51e-13
## 6 noc          -0.0220    0.0738    -0.298 7.66e- 1
## 7 foreignyes    1.31      0.200      6.56  5.38e-11

From the results above, one can only interpret the sign of the coefficients. To know how much a variable influences the labour force participation, one has to use margins():

##      factor     AME     SE       z      p   lower   upper
##         age -0.1064 0.0176 -6.0494 0.0000 -0.1409 -0.0719
##        educ  0.0066 0.0060  1.0955 0.2733 -0.0052  0.0185
##  foreignyes  0.2834 0.0399  7.1102 0.0000  0.2053  0.3615
##     lnnlinc -0.1699 0.0415 -4.0994 0.0000 -0.2512 -0.0887
##         noc -0.0046 0.0154 -0.2981 0.7656 -0.0347  0.0256
##         nyc -0.2775 0.0333 -8.3433 0.0000 -0.3426 -0.2123

We can use the previous code to plot the marginal effects:

So an infinitesimal increase, in say, non-labour income (lnnlinc) of 0.001 is associated with a decrease of the probability of labour force participation by 0.001*17 percentage points.

You can also extract the marginal effects of a single variable:

##   dydx_lnnlinc
## 1  -0.15667764
## 2  -0.20014487
## 3  -0.18495109
## 4  -0.05377262
## 5  -0.18710476
## 6  -0.19586986

Which makes it possible to extract the effect for a list of individuals that you can create yourself:

##   dydx_lnnlinc
## 1  -0.09228119
## 2  -0.17953451

I used the tribble() function from the tibble package to create this test data set, row by row. Then, using dydx(), I get the marginal effect of variable lnnlinc for these two individuals.

7.4 Comparing models

Let’s estimate another model on the same data; prices are only positive, so a linear regression might not be the best model, because the model allows for negative prices. Let’s look at the distribution of prices:

it looks like modeling the log of price might provide a better fit:

## # A tibble: 12 x 5
##    term          estimate  std.error statistic  p.value
##    <chr>            <dbl>      <dbl>     <dbl>    <dbl>
##  1 (Intercept) 10.0       0.0472        212.   0.      
##  2 lotsize      0.0000506 0.00000485     10.4  2.91e-23
##  3 bedrooms     0.0340    0.0145          2.34 1.94e- 2
##  4 bathrms      0.168     0.0206          8.13 3.10e-15
##  5 stories      0.0923    0.0128          7.20 2.10e-12
##  6 drivewayyes  0.131     0.0283          4.61 5.04e- 6
##  7 recroomyes   0.0735    0.0263          2.79 5.42e- 3
##  8 fullbaseyes  0.0994    0.0220          4.52 7.72e- 6
##  9 gashwyes     0.178     0.0446          4.00 7.22e- 5
## 10 aircoyes     0.178     0.0215          8.26 1.14e-15
## 11 garagepl     0.0508    0.0116          4.36 1.58e- 5
## 12 prefareayes  0.127     0.0231          5.50 6.02e- 8

Let’s take a look at the diagnostics:

## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC
## *     <dbl>         <dbl> <dbl>     <dbl>     <dbl> <int>  <dbl> <dbl>
## 1     0.677         0.670 0.214      102. 3.67e-123    12   73.9 -122.
## # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>

Let’s compare these to the ones from the previous model:

## Joining, by = c("r.squared", "adj.r.squared", "sigma", "statistic", "p.value", "df", "logLik", "AIC", "BIC", "deviance", "df.residual", "model")
## # A tibble: 2 x 12
##   r.squared adj.r.squared   sigma statistic   p.value    df  logLik     AIC
##       <dbl>         <dbl>   <dbl>     <dbl>     <dbl> <int>   <dbl>   <dbl>
## 1     0.673         0.666 1.54e+4     100.0 6.18e-122    12 -6034.   12094.
## 2     0.677         0.670 2.14e-1     102.  3.67e-123    12    73.9   -122.
## # ... with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
## #   model <chr>

I saved the diagnostics in two different data.frame using the glance() function and added a model column to indicate which model the diagnostics come from. Then I merged both datasets using full_join(), a dplyr function.

As you can see, the model with the logarithm of the prices as the explained variable has a higher likelihood (and thus lower AIC and BIC) than the simple linear model. Let’s take a look at the diagnostics plots:

7.5 Using a model for prediction

Once you estimated a model, you might want to use it for prediction. This is easily done using the predict() function that works with most models. Prediction is also useful as a way to test the accuracy of your model: split your data into a training set (used for estimation) and a testing set (used for the pseudo-prediction) and see if your model overfits the data. We are going to see how to do that in a later section; for now, let’s just get acquainted with predict().

Let’s go back to the models we estimated in the previous section, model3 and model_log. Let’s also take a subsample of data, which we will be using for prediction:

so that we get always the same pred_set I set the random seed first. Let’s take a look at the data:

##      price lotsize bedrooms bathrms stories driveway recroom fullbase
## 63   52000    4280        2       1       1      yes      no       no
## 340  62500    3900        3       1       2      yes      no       no
## 332 175000    8960        4       4       4      yes      no       no
## 339 141000    8100        4       1       2      yes     yes      yes
## 467  54000    2856        3       1       3      yes      no       no
## 347  52000    4130        3       2       2      yes      no       no
## 6    66000    4160        3       1       1      yes     yes      yes
## 126  95000    4260        4       2       2      yes      no       no
## 359  97000   12090        4       2       2      yes      no       no
## 277  70000    6300        3       1       1      yes      no       no
## 372  85000    6420        3       1       1      yes      no      yes
## 292  39000    4000        3       1       2      yes      no       no
## 151  59900    3450        3       1       2      yes      no       no
## 493  53000    4050        2       1       1      yes      no       no
## 156  60000    2610        4       3       2       no      no       no
## 445  72500    5720        2       1       2      yes      no       no
## 152  35500    3000        3       1       2       no      no       no
## 142  40000    2650        3       1       2      yes      no      yes
## 99   35000    3500        2       1       1      yes     yes       no
## 123  37000    4400        2       1       1      yes      no       no
##     gashw airco garagepl prefarea
## 63     no   yes        2       no
## 340    no    no        0       no
## 332    no   yes        3       no
## 339    no   yes        2      yes
## 467    no    no        0      yes
## 347    no    no        2       no
## 6      no   yes        0       no
## 126   yes    no        0       no
## 359    no    no        2      yes
## 277    no   yes        2       no
## 372    no   yes        0      yes
## 292    no    no        1       no
## 151    no    no        1       no
## 493    no    no        0       no
## 156    no    no        0       no
## 445    no   yes        0      yes
## 152    no    no        0       no
## 142    no    no        1       no
## 99     no    no        0       no
## 123    no    no        0       no

If we wish to use it for prediction, this is easily done with predict():

##        63       340       332       339       467       347         6 
##  63506.66  49425.47 150689.71 106607.68  61649.59  73066.34  66387.12 
##       126       359       277       372       292       151       493 
##  79701.11 112496.42  72502.20  79260.00  54024.93  52074.46  41568.47 
##       156       445       152       142        99       123 
##  68666.08  76050.14  39546.02  54689.81  44129.28  42809.67

This returns a vector of predicted prices. This can then be used to compute the Root Mean Squared Error for instance. Let’s do it within a tidyverse pipeline:

The root mean square error of model3 is 1666.1312666.

I also use the n() function which returns the number of observations in a group (or all the observations, if the data is not grouped). Let’s compare model3 ’s RMSE with the one from model_log:

Don’t forget to exponentiate the predictions, remember you’re dealing with a log-linear model! model_log’s RMSE is 1359.0392252 which is lower than model3’s. However, keep in mind that the model was estimated on the whole data, and then the prediction quality was assessed using a subsample of the data the model was estimated on… so actually we can’t really say if model_log’s predictions are very useful. Of course, this is the same for model3. In a later section we are going to learn how to do cross validation to avoid this issue.

Also another problem of what I did before, unrelated to statistics per se, is that I wanted to compute the same quantity for two different models, and did so by copy and pasting 3 lines of code. That’s not much, but if I wanted to compare 10 models, copy and paste mistakes could have sneaked in. Instead, it would have been nice to have a function that computes the RMSE and then use it on my models. We are going to learn how to write our own function and use it just like if it was another built-in R function.

7.6 Beyond linear regression

R has a lot of other built-in functions for regression, such as glm() (for Generalized Linear Models) and nls() for (for Nonlinear Least Squares). There are also functions and additional packages for time series, panel data, machine learning, bayesian and nonparametric methods. Presenting everything here would take too much space, and would be pretty useless as you can find whatever you need using an internet search engine. What you have learned until now is quite general and should work on many type of models. To help you out, here is a list of methods and the recommended packages that you can use:

Model Package Quick example
Robust Linear Regression MASS rlm(y ~ x, data = mydata)
Nonlinear Least Squares stats2 nls(y ~ x1 / (1 + x2), data = mydata)3
Logit stats glm(y ~ x, data = mydata, family = "binomial")
Probit stats glm(y ~ x, data = mydata, family = binomial(link = "probit"))
K-Means stats kmeans(data, n)4
PCA stats prcomp(data, scale = TRUE, center = TRUE)5
Multinomial Logit mlogit Requires several steps of data pre-processing and formula definition, refer to the Vignette for more details.
Cox PH survival coxph(Surv(y_time, y_status) ~ x, data = mydata)6
Time series Several, depending on your needs. Time series in R is a vast subject that would require a very thick book to cover. You can get started with the following series of blog articles, Tidy time-series, part 1, Tidy time-series, part 2, Tidy time-series, part 3 and Tidy time-series, part 3
Panel data plm plm(y ~ x, data = mydata, model = "within|random")
Neural Networks Several, depending on your needs. R is a very popular programming language for machine learning. This blog post lists and compares some of the most useful packages for Neural nets and deep learning.
Nonparametric regression np Several functions and options available, refer to the Vignette for more details.

I put neural networks in the table, but you can also find packages for regression trees, naive bayes, and pretty much any machine learning method out there! The same goes for Bayesian methods. Popular packages include rstan, rjags which link R to STAN and JAGS (two other pieces of software that do the Gibbs sampling for you) which are tools that allow you to fit very general models. It is also possible to estimate models using Bayesian inference without the need of external tools, with the bayesm package which estimates the usual micro-econometric models. There really are a lot of packages available for Bayesian inference, and you can find them all in the related CRAN Task View.


  1. This package gets installed with R, no need to add it

  2. The formula in the example is shown for illustration purposes.

  3. data must only contain numeric values, and n is the number of clusters.

  4. data must only contain numeric values, or a formula can be provided.

  5. Surv(y_time, y_status) creates a survival object, where y_time is the time to event y_status. It is possible to create more complex survival objects depending on exactly which data you are dealing with.