Preface

The following provides a brief introduction to generalized additive models and some thoughts on getting started within the R environment. It doesn’t assume much more than a basic exposure to regression, and maybe a general idea of R though not necessarily any particular expertise. The presentation is of a very applied nature, and such that the topics build upon the familiar and generalize to the less so, with the hope that one can bring the concepts they are comfortable with to the new material. The audience in mind is a researcher with typical applied science training.

As this document is more conceptual, a basic familiarity with R is all that is needed to follow the code, though there is much to be gained from simple web browsing on R if one needs it. And while it wasn’t the intention starting out, this document could be seen as a vignette for the mgcv package, which is highly recommended.

This document was created with Rstudio and rmarkdown. Last modified 2017-03-25. Original draft August, 2012.

Color guide:

  • important term
  • link
  • package
  • function
  • object or class

R Info: R version 3.3.3 (2017-03-06) Another Canoe

Introduction

Beyond the General Linear Model I

General Linear Model

Let us begin by considering the standard linear regression model (SLiM) estimated via ordinary least squares (OLS). We have some response/variable we wish to study, and believe it to be some function of other variables. In the margin to the right, \(y\) is the variable of interest, \[y\sim \mathcal{N}(\mu,\sigma^2) \\ \mu = b_0+b_1X_1+b_2X_2...+b_pX_p\] assumed to be normally distributed with mean \(\mu\) and variance \(\sigma^2\), and the Xs are the predictor variables/covariates in this scenario. The predictors are multiplied by the coefficients (\(\beta\)) and summed, giving us the linear predictor, which in this case also directly provides us the estimated fitted values.

Since we’ll be using R and might as well get into that mindset, I’ll give an example of how the R code might look like.

mymod = lm(y ~ x1 + x2, data=mydata)

One of the issues with this model is that, in its basic form it can be very limiting in its assumptions about the data generating process for the variable we wish to study. Also, while the above is one variant of the usual of the manner in which it is presented in introductory texts, it also very typically will not capture what is going on in a great many data situations.

Generalized Linear Model

In that light, we may consider the generalized linear model. Generalized linear models incorporate other types of distributionsOf the exponential family., and include a link function \(g(.)\) relating the mean \(\mu\), or stated differently, the expected values \(E(y)\), to the linear predictor \(X\beta\), often denoted \(\eta\). The general form is thus \(g(\mu) = X\beta\).

\[g(\mu) = \eta = X\beta \\ E(y) = \mu = g^{-1}(\eta)\]

Consider again the typical linear regression. We assume a Gaussian (i.e. Normal) distribution for the response, we assume equal variance for all observations, and that there is a direct link of the linear predictor and the expected value \(\mu\), i.e. \(X\beta = \mu\). In fact the typical linear regression model is a generalized linear model with a Gaussian distribution and ‘identity’ link function.

To further illustrate the generalization, we consider a distribution other than the Gaussian. In the example to the right, we examine a Poisson distribution for some vector of count data.
\[y \sim \mathcal{P}(\mu) \\ g(\mu) = b_0+b_1X_1+b_2X_2...+b_pX_p\] There is only one parameter to be considered, \(\mu\), since for the Poisson the mean and variance are equal. For the Poisson, the (canonical) link function \(g(.)\), is the natural log, and so relates the log of \(\mu\) to the linear predictor. As such we could also write it in terms of exponentiating the right hand side. \[\mu = e^{b_0+b_1X_1+b_2X_2...+b_pX_p}\]

While there is a great deal to further explore with regard to generalized linear models, the point here is to simply to note them as a generalization of the typical linear model with which all would be familiar after an introductory course in statistics. As we eventually move to generalized additive models, we can see them as a subsequent step in the generalization.

Generalized Additive Model

Now let us make another generalization to incorporate nonlinear forms of the predictors. The form \[ y \sim ExpoFam(\mu, etc.) \\ \mu = E(y) \\ g(\mu) = b_0 + f(x_1) + f(x_2) +...+f(x_p)\] at the right gives the new setup relating our new, now nonlinear predictor to the expected value, with whatever link function may be appropriate.

So what’s the difference? In short, we are using smooth functions of our predictor variables, which can take on a great many forms, with more detail in the following section. For now, it is enough to note the observed values \(y\) are assumed to be of some exponential family distribution, and \(\mu\) is still related to the model predictors via a link function. The key difference is that the linear predictor now incorporates smooth functions of at least some (possibly all) covariates, represented as \(f(x)\).

Beyond the General Linear Model II

Fitting the Standard Linear Model

In many data situations the relationship between some covariate(s) \(X\) and response \(y\) is not so straightforward. Consider the following plot. Attempting to fit a standard OLS regression results in the blue line, which doesn’t capture the relationship very well.

Polynomial Regression

\[ y \sim \mathcal{N}(\mu,\sigma^2)\\ \mu = b_0 + b_{1}X_1+b_{2}X^2 \] One common approach we could undertake is to add a transformation of the predictor \(X\), and in this case we might consider a quadratic term such that our model looks something like that to the right.

We haven’t really moved from the general linear model in this case, but we have a much better fit to the data as evidenced by the graph.

Scatterplot Smoothing

There are still other possible routes we could take. Many are probably familiar with loess (or lowess) regression, if only in the sense that often statistical packages may, either by default or with relative ease, add a nonparametric fit line to a scatterplotSee the scatterplots in the car package for example.. By default this is typically lowess, or locally weighted scatterplot smoothing.

Take a look at the following figure. For every (sorted) \(x_0\) value, we choose a neighborhood around it and, for example, fit a simple regression. As an example, let’s look at \(x_0\) = 3.0, and choose a neighborhood of 100 X values below and 100 values above.

Now, for just that range we fit a linear regression model and note the fitted value where \(x_0=3.0\).

If we now do this for every \(x_0\) value, we’ll have fitted values based on a rolling neighborhood that is relative to each value being considered. Other options we could throw into this process, and usually do, would be to fit a polynomial regression for each neighborhood, and weight observations the closer they are to the value, with less weight given to more distant observations.

The above plot shows the result from such a fitting process. For comparison the regular regression fit is also provided. Even without using a lowess approach, we could have fit have fit a model assuming a quadratic relationship, \(y = x + x^2\), and it would result in a far better fit than the simpler model\(y\) was in fact constructed as \(x + x^2 +\) noise.. While polynomial regression might suffice in some cases, the issue is that nonlinear relationships are generally not specified so easily, as we’ll see nextBolker 2008 notes an example of fitting a polynomial to 1970s U.S. Census data that would have predicted a population crash to zero by 2015..

Generalized Additive Models

The next figure regards a data set giving a series of measurements of head acceleration in a simulated motorcycle accidentThis is a version of that found in Venables and Ripley (2002). Time is in milliseconds, acceleration in g. Here we have data that are probably not going to be captured with simple transformations of the predictors. We can compare various approaches, and the first is a straightforward cubic polynomial regression, which unfortunately doesn’t help us much. We could try higher orders, which would help, but in the end we will need something more flexible, and generalized additive models can help us in such cases.

Summary

Let’s summarize our efforts up to this point.

  • Standard Linear Model

\[y\sim \mathcal{N}(\mu, \sigma^{2})\] \[\mu = b_{0}+b_{1}X_{1}\]

  • Polynomial Regression

\[y\sim \mathcal{N}(\mu, \sigma^{2})\] \[\mu = b_{0}+b_{1}X_{1}+b_{2}X^2\]

  • GLM formulation

\[y\sim \mathcal{N}(\mu, \sigma^{2})\] \[g(\mu) = b_{0}+b_{1}X_{1}+b_{2}X\]

  • GAM formulation

\[y\sim \mathcal{N}(\mu, \sigma^{2})\] \[g(\mu) = f(X)\]

Now that some of the mystery will hopefully have been removed, let’s take a look at GAMs in action.

Application Using R

Initial Examination

The data set has been constructed using average Science scores by country from the Programme for International Student Assessment (PISA) 2006, along with GNI per capita (Purchasing Power Parity, 2005 dollars), Educational Index, Health Index, and Human Development Index from UN data. The key variables are as follows (variable abbreviations in bold):

The education component is measured by mean of years of schooling for adults aged 25 years and expected years of schooling for children of school entering age, the health index by life expectancy at birth, and the wealth component is based on the gross national income per capita. The HDI sets a minimum and a maximum for each dimension, and values indicate where each country stands in relation to these endpoints, expressed as a value between 0 and 1. More information on the HDI measures can be found here.

  • Overall Science Score (average score for 15 year olds)
  • Interest in science
  • Support for scientific inquiry
  • Income Index
  • Health Index
  • Education Index
  • Human Development Index (composed of the Income index, Health Index, and Education Index)

But even before we get too far, it would be good to know our options in the world of GAMs. Near the end of this document is a list of some packages to be aware of, and there is quite a bit of GAM functionality available within R, even for just plottingggplot2 has basic gam functionality for scatterplot smoothing. Examples are further in the text.. For our purposes we will use mgcv.

The first thing to do is get the data in and do some initial inspections.

pisa = read.csv('data/pisasci2006.csv')
psych::describe(pisa)[-1,1:9]  #univariate
         vars  n   mean    sd median trimmed   mad    min    max
Overall     2 57 473.14 54.58 489.00  477.45 48.93 322.00 563.00
Issues      3 57 469.91 53.93 489.00  474.28 41.51 321.00 555.00
Explain     4 57 475.02 54.02 490.00  478.85 47.44 334.00 566.00
Evidence    5 57 469.81 61.74 489.00  475.11 54.86 288.00 567.00
Interest    6 57 528.16 49.84 522.00  525.72 53.37 448.00 644.00
Support     7 57 512.18 26.08 512.00  512.15 25.20 447.00 569.00
Income      8 61   0.74  0.11   0.76    0.75  0.11   0.41   0.94
Health      9 61   0.89  0.07   0.89    0.89  0.08   0.72   0.99
Edu        10 59   0.80  0.11   0.81    0.80  0.12   0.54   0.99
HDI        11 59   0.81  0.09   0.82    0.81  0.09   0.58   0.94

The scatterplot matrix has quite a bit of information to spend some time with- univariate and bivariate density, as well as loess curves. For our purposes we will ignore the issue regarding the haves vs. have-nots on the science scale and save that for another day. Note the surprising negative correlation between interest and overall score for science, which might remind us of Simpson’s paradox, namely, that what occurs for the individual may not be the same for the group. One will also note that while there is a positive correlation between Income and Overall science scores, it flattens out after an initial rise. Linear fits might be difficult to swallow for human development sub-scales in general, but let’s take a closer look.

We can see again that linear fits aren’t going to do so well for some, though it might be a close approximation for interest in science and support for science. Now let’s run the smooth. By default ggplot2 will use a loess smoother for small data sets (i.e. < 1000 observations), but can use the mgcvgam function as a smoother.

Often in these situations people will perform some standard transformation, such as a log, but it doesn’t help as nearly as often as it is used. For example, in this case one can log the overall score, Income, or both and a linear relation will still not be seen.

Single Predictor

Linear Fit

We will start with the simple situation of a single predictor. Let’s begin by using a typical linear regression to predict science scores by the Income index. We could use the standard R lm function, but I’ll leave that as an exercise for comparison. We can still do straightforward linear models with the gam function, and again it is important to note that the standard linear model can be seen as a special case of a GAM.

library(mgcv)
mod_lm <- gam(Overall ~ Income, data=pisa)
summary(mod_lm)

Family: gaussian 
Link function: identity 

Formula:
Overall ~ Income

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   204.32      35.37   5.777 4.32e-07 ***
Income        355.85      46.79   7.606 5.36e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


R-sq.(adj) =  0.518   Deviance explained = 52.7%
GCV = 1504.5  Scale est. = 1448.8    n = 54

What are we getting here? The same thing you get from a regular linear model, because you just ran one. However there are a couple things to look at. The coefficient is statistically significant, but serves as a reminder that it usually a good idea to scale predictor variables so that the effect is more meaningful. Here, moving one unit on Income is akin from going broke to being the richest country. But in a more meaningful sense, if we moved from say, .7 to .8, we’d expect an increase of about 35 points on the science score. We also see the deviance explainedFor those more familiar with generalized linear models, this is calculated as (Dev\(_{Null}\)-Dev\(_{Residual}\))/Dev\(_{Null}\)
One can verify this by running the same model via the glm, and using the corresponding values from the summary of the model object.
, which serves as a generalization of R-squared, and in this case, it actually is equivalent to the unadjusted R-squared. Likewise there is the familiar adjusted version of it to account for small sample size and model complexity. The scale estimate is the scaled deviance, which here is equivalent to the residual sums of squares. The GCV score we will save for when we run a GAM.

GAM

Let’s now try some nonlinear approaches, keeping in mind that \(\mu=f(x)\). As a point of comparison, we can start by trying a standard polynomial regression, and it might do well enoughThis example is pretty much straight from Wood (2006) with little modification.. To begin we must consider a basis to use, a space that \(f\) is an element of. Doing so leads to choosing a set of basis functions \(b_j\), with parameters \(\beta_j\) that will be combined to produce \(f(x)\):

\[f(x)=\displaystyle\sum\limits_{j=1}^q b_{j}(x)\beta_{j}\]

To better explain by example, if we use a cubic polynomial, the basis is: \(b_1(x) = 1\), \(b_2(x)=x\), \(b_3(x)=x^2\), \(b_4(x)=x^3\), which leads to the following:

\[f(x) = \beta_1 + \beta_2x + \beta_3x^2 + \beta_4x^3\]

The following visualization allows us to see the effects in action. It is based on the results extracted from running such a modelSee poly for how to fit a polynomial in R. and obtaining the coefficients. The first plot represents the intercept of 470.44, the second plot, our \(b_2\) coefficient of 289.5 multiplied by Income and so forth. The bottom plot shows the final fit \(f(x)\), i.e. the linear combination of the basis functions.

At this point we have done nothing we couldn’t do in our regular regression approach, as polynomial regression has a long history of use in modeling. However, the take home message is that as we move to GAMs we are going about things in much the same fashion; we are simply changing the nature of the basis, and have a great deal more flexibility in choosing the form.

In the next figure I show the fit using a ‘by-hand’ cubic spline basis (see the [Appendix][Penalized_Estimation_Example] and p.126-7 in Wood (2006)).

A cubic spline is essentially a connection of multiple cubic polynomial regressions. We choose points of the predictor variable at which to create sections, and these points are referred to as knots. Separate cubic polynomials are fit at each section, and then joined at the knots to create a continuous curve. The above graph represents a cubic spline with 8 knotsTen including the endpoints. between the first and third quartiles. The red line uses the GAM functionality within ggplot2’s geom_smooth as a point of comparison.

Fitting the model

Let’s now fit an actual generalized additive model using the same cubic spline as our basis. We again use the gam function as before for basic model fitting, but now we are using a function s within the formula to denote the smooth terms. Within that function we also specify the type of smooth, though a default is available. I chose bs = cr, denoting cubic regression splines, to keep consistent with our previous example.

mod_gam1 <- gam(Overall ~ s(Income, bs="cr"), data=pisa)
summary(mod_gam1)

Family: gaussian 
Link function: identity 

Formula:
Overall ~ s(Income, bs = "cr")

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  470.444      4.082   115.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
            edf Ref.df     F  p-value    
s(Income) 6.895  7.741 16.67 1.59e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =    0.7   Deviance explained = 73.9%
GCV = 1053.7  Scale est. = 899.67    n = 54

The first thing to note is, that aside from the smooth part, our model code is similar to what we’re used to with core R functions such as lm and glm. In the summary, we first see the distribution assumed as well as the link function used, in this case normal and identity, respectively, which to iterate, had we had no smoothing, would result in a SLiM. After that we see that the output is separated into parametric and smooth, or nonparametric partsAs an aside, the term nonparametric has at least two general uses in the statistical world. Wikipedia has a nice delineation.. In this case, the only parametric component is the intercept, but it’s good to remember that you are not bound to smooth every effect of interest, and indeed, as we will discuss in more detail later, part of the process may involve refitting the model with terms that were found to be linear for the most part anyway. The smooth component of our model regarding a country’s income and its relationship with overall science score suggests it is statistically significant, but there are a couple of things in the model summary that would be unfamiliar.

This part regarding edf should ring a bell for those who use the lme4 package for mixed models. All of this applies there too, and may provide more insight as to why they don’t even provide p-values as a default. See the Other Approaches section.We’ll start with the effective degrees of freedom, or edf. In typical OLS regression the model degrees of freedom is equivalent to the number of predictors/terms in the model. This is not so straightforward with a GAM due to the smoothing process and the penalized regression estimation procedure, something that will be discussed more laterIn this example there are actually 9 terms associated with this smooth, but they are each ‘penalized’ to some extent and thus the edf does not equal 9.. In this situation, we are still trying to minimize the residual sums of squares, but also have a built in penalty for ‘wiggliness’ of the fit, where in general we try to strike a balance between an undersmoothed fit and an oversmoothed fit. The default p-value for the test is based on the effective degrees of freedom and the rank \(r\) of the covariance matrix for the coefficients for a particular smooth, so here, conceptually, it is the p-value associated with the \(F(r, n-edf)\). However there are still other issues to be concerned about, and ?summary.gam will provide your first step down that particular rabbit hole. For hypothesis testing an alternate edf is actually used, which is the other one provided there in the summary resultHere it is noted Ref.df but if, for example, the argument p.type = 5 is used, it will be labeled Est.Rank. Also, there are four p-value types one can choose from. The full story of edf, p-values and related is scattered throughout Wood’s text. See also ?anova.gam. At this point you might be thinking these p-values are a bit fuzzy, and you’d be right. The gist is, they aren’t to be used for harsh cutoffs, say, at an arbitrary .05 levelBut then, standard p-values shouldn’t be used that way either., but if they are pretty low you can feel comfortable claiming statistical significance, which of course is the end all, be all, of the scientific endeavor- right?

The GCV, or generalized cross validation score can be taken as an estimate of the mean square prediction error based on a leave-one-out cross validation estimation process. We estimate the model for all observations except \(i\), then note the squared residual predicting observation \(i\) from the model. Then we do this for all observations. However, the GCV score is an efficient measure of this concept that doesn’t actually require fitting all those models and overcomes other issuesIn this initial model the GCV can be found as: \[GCV = \frac{n*scaled\, est.}{(n-edf-{[n\,of\, parametric\, terms]})^{2}}\]. It is this score that is minimized by default when determining the specific nature of the smooth. On its own it doesn’t tell us much, but we can use it similar to AIC as a comparative measure to choose among different models, with lower being better.

Graphical Display

One can get sense of the form of the fit by simply plotting the model object as follows:

## plot(mod_gam1)

Note that this will only plot the smooth terms in the modelA better alternative is with the visreg package, which will plot all effects, and by default be nicer looking and easier to customize. You can simply feed the model to the visreg function. It also has 2 and 3d capabilities like the corresponding mgcv functions.. The intervals are Bayesian credible intervals based on the posterior predictive distribution. In this single predictor case one can also revisit the previous graph.

Model Comparison

Let us now compare our regular regression fit to the GAM model fit. The following shows how one can extract various measures of performance, and the subsequent table shows them gathered together.

AIC(mod_lm)
[1] 550.2449
summary(mod_lm)$sp.criterion
  GCV.Cp 
1504.496 
summary(mod_lm)$r.sq  #adjusted R squared
[1] 0.5175346

Do the same to extract those same elements from the GAM. The following display makes for easy comparison.

  AIC GCV R2
LM 550.24 1504.50 0.52
GAM 529.81 1053.73 0.70

Comparing these various measures, it’s safe to conclude that the GAM fits better. We can also perform the familiar statistical test via the anova function we apply to other R model objects. As with the previous p-value issue, we can’t get too carried away, and technically one could have a model with even more terms but lower edf, but would be difficult to interpret?anova.gam for more information.. As it would be best to be conservative, we’ll proceed cautiously.

anova(mod_lm, mod_gam1, test="Chisq")
Analysis of Deviance Table

Model 1: Overall ~ Income
Model 2: Overall ~ s(Income, bs = "cr")
  Resid. Df Resid. Dev     Df Deviance  Pr(>Chi)    
1    52.000      75336                              
2    45.259      41479 6.7411    33857 2.778e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It would appear the anova results tell us what we have probably come to believe already, that incorporating nonlinear effects has improved the model considerably.

Multiple Predictors

Let’s now see what we can do with a more realistic case where we have added model complexity.

Linear Fit

We’ll start with the SLiM approach again, this time adding the Health and Education indices.

mod_lm2 <- gam(Overall ~ Income + Edu + Health, data=pisa)
summary(mod_lm2)

Family: gaussian 
Link function: identity 

Formula:
Overall ~ Income + Edu + Health

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   121.18      78.97   1.535   0.1314    
Income        182.32      85.27   2.138   0.0376 *  
Edu           234.11      54.78   4.274 9.06e-05 ***
Health         27.01     134.90   0.200   0.8421    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


R-sq.(adj) =  0.616   Deviance explained = 63.9%
GCV = 1212.3  Scale est. = 1119      n = 52

It appears we have statistical effects for Income and Education, but not for Health, and the adjusted R-squared suggests a notable amount of the variance is accounted forNote that a difference in sample sizes do not make this directly comparable to the first model.. Let’s see about nonlinear effects.

GAM

As far as the generalized additive model goes, we can approach things in a similar manner as before. We will ignore the results of the linear model for now and look for nonlinear effects for each covariate.

The default smoother for s() is the argument bs='tp', a thin plate regression spline.

mod_gam2 <- gam(Overall ~ s(Income) + s(Edu) + s(Health), data=pisa)
summary(mod_gam2)

Family: gaussian 
Link function: identity 

Formula:
Overall ~ s(Income) + s(Edu) + s(Health)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  471.154      2.772     170   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
            edf Ref.df     F  p-value    
s(Income) 7.593  8.415 8.826 1.33e-07 ***
s(Edu)    6.204  7.178 3.308  0.00733 ** 
s(Health) 1.000  1.000 2.736  0.10661    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.863   Deviance explained = 90.3%
GCV = 573.83  Scale est. = 399.5     n = 52

There are again a couple things to take note of. First, statistically speaking, we come to the same conclusion as the linear model regarding the individual effects. One should take particular note of the effect of Health index. The effective degrees of freedom with value 1 suggests that it has essentially been reduced to a simple linear effect. The following will update the model to explicitly model the effect as such, but as one can see, the results are identical.

mod_gam2B = update(mod_gam2, .~.-s(Health) + Health)
summary(mod_gam2B)

Family: gaussian 
Link function: identity 

Formula:
Overall ~ s(Income) + s(Edu) + Health

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    640.3      102.3   6.260 3.06e-07 ***
Health        -189.5      114.6  -1.654    0.107    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
            edf Ref.df     F  p-value    
s(Income) 7.593  8.415 8.826 1.33e-07 ***
s(Edu)    6.204  7.178 3.308  0.00733 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.863   Deviance explained = 90.3%
GCV = 573.83  Scale est. = 399.5     n = 52

We can also note that this model accounts for much of the variance in Overall science scores, with an adjusted R-squared of .86. In short, it looks like the living standards and educational resources of a country are associated with overall science scores, even if we don’t really need the Health index in the model.

Graphical Display

Now we examine the effects of interest visually via component plotsSee e.g. John Fox’s texts and chapters regarding regression diagnostics, and his crplots function in the car package..

Here we can see the effects of interestOne will notice that for the default gam plot in mgcv, the y-axis scale is on that of the linear predictor, but due to identifiability constraints, the smooths must sum to zero, and thus are presented in a mean-centered fashion., and again one might again note the penalized-to-linear effect of Health. As before, we see the tapering off of Income’s effect at its highest level, and in addition, a kind of sweet spot for a positive effect of Education in the mid-range values, with a slight positive effect overall. Health, as noted, has been reduced to a linear, surprisingly negative effect, but again this is not statistically significant.

The following code demonstrates how to create data necessary for the plot, specifically for incomes, with the other predictors held at their mean.

# Note that mod_gam2$model is the data that was used in the modeling process, 
# so it will have NAs removed.
testdata = data.frame(Income = seq(.4, 1, length = 100),
                      Edu = mean(mod_gam2$model$Edu),
                      Health = mean(mod_gam2$model$Health))
fits = predict(mod_gam2, newdata=testdata, type='response', se=T)
predicts = data.frame(testdata, fits) %>% 
  mutate(lower = fit - 1.96*se.fit,
         upper = fit + 1.96*se.fit)

plot_mod_gam2_response = ggplot(aes(x=Income,y=fit), data=predicts) +
  geom_ribbon(aes(ymin = lower, ymax=upper), fill='gray90') +
  geom_line(color='#1e90ff') +
  theme_trueMinimal()

2d Smooths

This gives us a sense for one predictor, but let’s take a gander at Income and Education at the same time. Previously, we saw how to use the plot method for a GAM class object. There is another plotting function, vis.gam, that will give us a bit more to play with, and specifically to display 2d smooths. It has a corresponding version from visreg which will look a bit better by default, visreg2d. The following code will produce a contour plot with Income on the x axis, Education on the y, with values on the response scale given by the contours, with lighter color indicating higher values. The actual plot provided instead depicts a heatmapI’m actually using the data from such plots and plotly to produce a more web-oriented version. For example, contour lines are unnecessary when you can get the values by hovering..

vis.gam(mod_gam2, type='response', plot.type='contour')
visreg2d(mod_gam2, xvar='Income', yvar='Edu', scale='response')

First and foremost the figure reflects the individual plots, and we can see that middling to high on Education and high on Income generally produces the highest scores. Conversely, being low on both the Education and Income indices are associated with poor Overall science scores. However, while interesting, these respective smooths were created separately of one another, and there is another way we might examine how these two effects work together in predicting the response.

So let’s take a look at another approach, continuing the focus on visual display. It may not be obvious at all, but one can utilize smooths of more than one variable, in effect, a smooth of the smooths of the variables that go into it. This is akin to an interaction in typical modelsSee the smooth ti for an ANOVA decomposition of such smooths into main effects and interaction.. Let’s create a new model to play around with this feature. After fitting the model, I provide a visualization for comparison to the previous, as well as a 3D view one can rotate to their liking.

mod_gam3 <- gam(Overall ~ te(Income, Edu), data=pisa)
summary(mod_gam3)

Family: gaussian 
Link function: identity 

Formula:
Overall ~ te(Income, Edu)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  471.154      3.349   140.7   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                edf Ref.df     F p-value    
te(Income,Edu) 10.1  12.19 16.93  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =    0.8   Deviance explained =   84%
GCV = 741.42  Scale est. = 583.17    n = 52
## 
## vis.gam(mod_gam3, type='response', plot.type='persp',
##         phi=30, theta=30,n.grid=500, border=NA)
## visreg2d(mod_gam3, xvar='Income', yvar='Edu', scale='response')

In the above we are using a type of smooth called a tensor product smooth, and by smoothing the marginal smooths of Income and Education, we see a bit clearer story. As we might suspect, wealthy countries with more of an apparent educational infrastructure are going to score higher on the Overall science score. However, wealth alone does not necessarily guarantee higher science scores (note the dark bottom right corner on the contour plot)Probably this is due to Qatar. Refer again to the previous figure., though without at least moderate wealth hopes are fairly dim for a decent score.

One can also, for example, examine interactions between a smooth and a linear term \(f(x)z\), and in a similar vein of thought look at smooths at different levels of a grouping factor.

Model Comparison

As before, we can examine indices such as GCV or perhaps adjusted R-squared, which both suggest our GAM performs considerably better. Statistically we can compare the two models with the anova function as before.

anova(mod_lm2, mod_gam2, test="Chisq")
Analysis of Deviance Table

Model 1: Overall ~ Income + Edu + Health
Model 2: Overall ~ s(Income) + s(Edu) + s(Health)
  Resid. Df Resid. Dev     Df Deviance  Pr(>Chi)    
1    48.000      53713                              
2    34.408      14463 13.592    39250 6.754e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Not that we couldn’t have assumed as such already, but now we have additional statistical evidence to suggest that incorporating nonlinear effects improves the model.

Other Issues

Estimation

As noted previously, estimation of GAMs in the mgcv package is conducted via a penalized likelihood approach. Conceptually this amounts to fitting the following model:

\[g(\mu) = X\beta + f(x_1) + f(x_2) ... f(x_p)\]

But note that each smooth has its own model matrix made up of the basis functions. So for each smooth covariate \(j\) we have:

\[f_j = \tilde{X}_j \tilde{\beta}_j\]

Given a matrix of coefficients \(S\), we can more formally note a penalized likelihood function:

\[l_p(\beta)=\displaystyle l(\beta) - \frac{1}{2}\sum_j\lambda_j \beta^\mathrm{T} S_j\beta\]

where \(l(\beta)\) is the usual GLM likelihood function, and \(\lambda_j\) are the smoothing parameters. The part of the function including \(\lambda\) penalizes curvature in the function, where \(\lambda\) establishes a trade-off between the goodness of fit and the smoothness, and such an approach will allow for less overfitting\(\lambda\rightarrow\infty\) would result in a straight line estimate for \(f_j\), while \(\lambda = 0\) would allow any \(f\) that interpolates the data. See p. 128, Wood (2006).. Technically we could specify the smoothing parameters explicitly, and the Appendix has some ‘by-hand’ code taken directly from Wood (2006) with only slight modifications, where the smoothing parameters are chosen and compared.

Smoothing parameters however are in fact estimated rather than arbitrarily set, and this brings us back to the cross-validation procedure mentioned before. Smoothing parameters are selected which minimize the GCV score by default, though one has other options. Note that there are other approaches to estimation such as backfitting, generalized smoothing splines and Bayesian.

Shrinkage & Variable Selection

Some smooths are such that no matter the smoothing parameter, there will always be non-zero coefficients for the basis functions. An extra penalty may be added such that if the smoothing parameter is large enough, the coefficients will shrink to zero, and some smoothing bases will have such alternative approaches availableSee also, the select argument to the gam function. Also, in version 1.7-19, summary.gam and anova.gam had changes made to improve the p-value computation in penalized situations.. In this manner one can assess whether a predictor is adding anything to the model, i.e. if it’s effective degrees of freedom is near zero, and perhaps use the approach as a variable selection technique.

Effective degrees of freedom again

If we define a matrix \(F\) that maps the unpenalized estimates of \(\beta\) to the penalized estimates such that

\[F = (X^T X + S)^{-1} X^T X\]

and note

\[\tilde{\beta} = (X^T X)^{-1} X^T y\] \[\hat{\beta} = F\tilde{\beta}\]

the diagonal elements of \(F\) are where the effective degrees of freedom for each covariate come from.

Choice of Smoothing Function

A number of smooths are available with the mgcv package, and one can learn more via the help file for smooth.terms (link). In our models we have used cubic regression splines and thin plate regression splines (TPRS), the latter being the default for a GAM in this package. As a brief summary, TPRS work well in general in terms of performance and otherwise has some particular advantages, and has a shrinkage alternative available. One should still feel free to play around, particularly when dealing with multiple smooths, where the tensor product smooths would be better for covariates of different scales.

Diagnostics

We have some built-in abilities to examine whether there are any particular issues, and we can try it with our second GAM model.

## gam.check(mod_gam2, k.rep=1000)


Method: GCV   Optimizer: magic
Smoothing parameter selection converged after 21 iterations.
The RMS GCV score gradient at convergence was 2.499332e-05 .
The Hessian was positive definite.
Model rank =  28 / 28 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

             k'   edf k-index p-value
s(Income) 9.000 7.593   1.258    0.96
s(Edu)    9.000 6.204   1.008    0.43
s(Health) 9.000 1.000   0.899    0.19

The plots are of the sort we’re used to from a typical regression setting, though it’s perhaps a bit difficult to make any grand conclusion based on such a small data set. One can inspect the quantile-quantile plot directly with qq.gam. The printed output on the other hand contains unfamiliar information, but is largely concerned with over-smoothing, and so has tests of whether the basis dimension for a smooth is too low. The p-values are based on simulation, so I bumped up the number with the additional argument. Guidelines are given in the output itself, and at least in this case it does not look like we have an issue. However, if there were a potential problem, it is suggested to double \(k\)\(k\) can be set as an argument to s(var1, k=?). and refit, and if the effective degrees of freedom increases quite a bit you would probably want to go with the updated modelI actually did this for Health, just for demonstration, and there was no change at all; it still reduced to a linear effect.. Given the penalization process, the exact choice of \(k\) isn’t too big of a deal, but the defaults are arbitrary. You want to set it large enough to get at the true effect as best as possible, but in some cases computational efficiency will also be of concern. For example, in fairly complex models with many predictors, interactions, etc., it might be worthwhile to reduce \(k\) at the outset. The help on the function choose.k provides another approach to examining \(k\) based on the residuals from the model under consideration, and provides other useful information.

Concurvity

Concurvity refers to the generalization of collinearity to the GAM setting. In this case it refers to the situation where a smooth term can be approximated by some combination of the others. It largely results in the same problem, i.e. unstable estimates.

Wood provides three indices related to concurvity via the concurvity function, all range from 0 to 1 with 0 suggesting no problem, and 1 indicating that the function lies entirely in the space of one or more of the other smooth terms. See ?concurvity for details.

concurvity(mod_gam2)
                 para s(Income)    s(Edu) s(Health)
worst    2.698463e-24 0.9839101 0.9651345 0.9745969
observed 2.698463e-24 0.7967932 0.6115974 0.8682284
estimate 2.698463e-24 0.7610774 0.6470586 0.7971159

It should probably come as little surprise that we may have an issue here given the nature of the covariates. We can certainly make assumptions about wealthier nations’ education and health status, for example. What can we do? Collinearity does not lead to biased estimates, only less stable ones, and the inflated variance can potentially be overcome with more data. We certainly can’t do that here as we are dealing with country level data. That may also provide the solution, since there is nothing to generalize to, as we have the population of interest (all countries with PISA scores). In other data settings however, we may need to think hard about what to include in the model to avoid such redundancy, take dimension reduction steps beforehand, or use some other selection technique. For example, if it is the result of having several time- or spatially- covarying predictors, one might be able retain only those that best capture that effect. However, the mgcv estimation procedures have been developed with such issues in mind, and one can feel fairly confident in the results even in the presence of concurvity. See Wood (2008).

Prediction

A previous example used the predict function on the data used to fit the model to obtain fitted values on the response scale. Typically we’d use this on new data. I do not cover it, because the functionality is the same as the predict.glm function in base R, and one can just refer to that. It is worth noting again that there is an option, type='lpmatrix', which will return the actual model matrix by which the coefficients must be pre-multiplied to get the values of the linear predictor at the supplied covariate values. This can be particularly useful towards opening the black box as one learns the technique.

Model Comparison Revisited

We have talked about automated smoothing parameter and term selection, and in general potential models are selected based on estimation of the smoothing parameter. Using an extra penalty to allow coefficients to tend toward zero with the argument select=TRUE is an automatic way to go about it, where some terms could effectively drop out. Otherwise we could compare models GCV/AIC scoresGCV scores are not useful for comparing fits of different families; AIC is still okay though., and in general either of these would be viable approaches. Consider the following comparison:

mod_1d = gam(Overall ~ s(Income) + s(Edu), data=pisa)
mod_2d = gam(Overall ~ te(Income,Edu, bs="tp"), data=pisa)
AIC(mod_1d, mod_2d)
             df      AIC
mod_1d 15.59154 476.0631
mod_2d 13.24670 489.7046

In some cases we might prefer to be explicit in comparing models with and without particular terms, and we can go about comparing models as we would with a typical GLM analysis of deviance. We have demonstrated this previously using the anova.gam function, where we compared linear fits to a model with an additional smooth function. While we could construct a scenario that is identical to the GLM situation for a statistical comparison, it should be noted that in the usual situation the test is actually an approximation, though it should be close enough when it is appropriate in the first place. The following provides an example that would nest the main effects of Income and Education within the product smooth, i.e. sets their basis dimension and smoothing function to the defaults employed by the tensor product smooth.

mod_A = gam(Overall ~ s(Income, bs="cr", k=5) + s(Edu, bs="cr", k=5), data=pisa)
mod_B = gam(Overall ~ ti(Income, bs="cr", k=5) + ti(Edu, bs="cr", k=5) + ti(Income, Edu, bs='cr'), data=pisa)

anova(mod_A,mod_B, test="Chisq")
Analysis of Deviance Table

Model 1: Overall ~ s(Income, bs = "cr", k = 5) + s(Edu, bs = "cr", k = 5)
Model 2: Overall ~ ti(Income, bs = "cr", k = 5) + ti(Edu, bs = "cr", k = 5) + 
    ti(Income, Edu, bs = "cr")
  Resid. Df Resid. Dev     Df Deviance Pr(>Chi)   
1    46.059      36644                            
2    40.030      24998 6.0291    11645 0.003322 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again though, we could have just used the summary output from the second model.

Instances where such a statistical test does not appear to be appropriate within the context of the package are when terms are able to be penalized to zero; in such a case p-values will be much too low. In addition, when comparing GAMs, sometimes the nesting of models would not be so clear when there are multiple smooths involved, and additional steps may need to be taken to make sure they are nested to use the statistical test. We must make sure that each smooth term in the null model has no more effective degrees of freedom than the same term in the alternative, otherwise it’s possible that the model with more terms can have lower effective degrees of freedom but better fit, rendering the test nonsensical. Wood (2006) suggests that if such model comparison is the ultimate goal an unpenalized approachThis can be achieved with the argument s(..., fx=T), although now one has to worry more about the k value used, as very high k will lead to low power. would be best to have much confidence in the p-valuesOne might also examine the gss package for an ANOVA based approach to generalized smoothing splines..

Other Approaches

This section will discuss some ways to relate the generalized models above to other forms of nonlinear modeling approaches, some familiar and others perhaps less so. In addition, I will note some extensions to GAMs to consider.

Other Nonlinear Modeling Approaches

Known Form

ItA general form for linear and nonlinear models: \[y = f(X,\beta)+\epsilon\] should be noted that one can place generalized additive models under a general heading of nonlinear models whose focus may be on transformations of the outcome (as with generalized linear models), the predictor variables (polynomial regression and GAMs), or both (GAMs), in addition to those whose effects are nonlinear in the parametersFor example, various theoretically motivated models in economics and ecology.. A primary difference between GAMs and those models is that we don’t specify the functional form beforehand in GAMs.

In cases where the functional form may be knownA common model example is the logistic growth curve., one can use an approach such as nonlinear least squares, and there is inherent functionality within a standard R installation, such as the nls function. As is the usual case, such functionality is readily extendable to a great many other analytic situations, e.g. the gnm for generalized nonlinear models or nlme for nonlinear mixed effects models.

Response Transformation

It is common practice, perhaps too common, to manually transform the response and go about things with a typical linear model. While there might be specific reasons for doing so, the primary reason applied researchers seem to do so is to make the distribution ‘more normal’ so that regular regression methods can be applied, which stems from a misunderstanding of the assumptions of standard regression. As an example, a typical transformation is to take the log, particularly to tame ‘outliers’ or deal with heteroscedasticity.

While it was a convenience ‘back in the day’ because we didn’t have software or computing power to deal with a lot of data situations aptly, this is definitely not the case now. In many situations it would be better to, for example, conduct a generalized linear model with a log link or perhaps assume a different distribution for the response directly (e.g. log- or skew-normal), and many tools allow researchers to do this with easeA lot of ‘outliers’ tend to magically go away with an appropriate choice of distribution for the data generating process..

There are still cases where one might focus on response transformation, just not so one can overcome some particular nuisance in trying to fit a linear regression. An example might be in some forms of functional data analysis, where we are concerned with some function of the response that has been measured on many occasions over time.

The Black Box

Venables and Ripley (2002, Section 11.5) make an interesting delineation of nonlinear models into those that are less flexible but under full user control (fully parametric)One could probably make the case that most modeling is ‘black box’ for a great many researchers., and those that are black box techniques that are highly flexible and fully automatic: stuff goes in, stuff comes out, but we’re not privy to the specificsFor an excellent discussion of these different approaches to understanding data see Breiman (2001) and associated commentary. For some general packages outside of R that incorporate a more algorithmic approach to modeling, you might check out the scikit-learn module for Python..

Two examples of the latter that they provide are projection pursuit and neural net models, though a great many would fall into such a heading. Projection pursuit models are well suited to high dimensional data where dimension reduction is a concern. One may think of an example where one uses a technique such as principal components analysis on the predictor set and then examines smooth functions of \(M\) principal components.

In the case of neural net models
A Neural Net Model

, which have found a resurgence of interest of late, one can imagine a model where the input units (predictor variables) are weighted and summed to create hidden layer units, which are then transformed and put through the same process to create outputs (see a simple example to the right). One can see projection pursuit models as an example where a smooth function is taken of the components which make up the hidden layer. Neural networks are highly flexible in that there can be any number of inputs, hidden layers, and outputs. However, such models are very explicit in the black box approach.

Such models are usually found among data mining/machine learning techniques, any number of which might be utilized in a number of disciplines. Other more algorithmic/black box approaches include k-nearest-neighbors, networks, random forests, support vector machines, and various tweaks or variations thereof including boosting, bagging, bragging and other alliterative shenanigansSee Hastie, Tibshirani, and Friedman (2009) for an overview of such approaches. A more recent and very applied version of that text is also available. I have an R oriented intro here.. As Venables and Ripley note, generalized additive models might be thought of as falling somewhere in between the fully parametric and interpretable models of linear regression and black box techniques. Indeed, there are algorithmic approaches which utilize GAMs as part of their approach.

Extensions

Other GAMs

Categorical variables

Note that just as generalized additive models are an extension of the generalized linear model, there are generalizations of the basic GAM beyond the settings described. In particular, random effects can be dealt with in this context as they can with linear and generalized linear models, and there is an interesting connection between smooths and random effects in general Wood (2006) has a whole chapter devoted to the subject, though Fahrmeir et al. (2013) provides an even fuller treatment. I also have a document on mixed models that goes into it some. In addition, Wood also provides a complementary package, gamm4, for adding random effects to GAMs via lme4.. This allowance for categorical variables, i.e. factors, works also to allow separate smooths for each level of the factor. This amounts to an interaction of the sort we demonstrated with two continuous variables.

Spatial Modeling

Additive models also provide a framework for dealing with spatially correlated data as well. As an example, a Markov Random Field smooth can be implemented for discrete spatial structure, such as countries or statesIncidentally, this same approach would potentially be applicable to network data as well, e.g. social networks.. For the continuous spatial domain, one can use the 2d smooth as was demonstrated previously, e.g. with latitude and longitude. See also the gaussian process paragraph.

Structured Additive Regression Models

The combination of random effects, spatial effects, etc. into the additive modeling framework is sometimes given a name of its own- structured additive regression models, or STARs. It is the penalized regression approach that makes this possible, where we have a design matrix that might include basis functions or an indicator matrix for groups, and an appropriate penalty matrix. With those two components we can specify the models in almost identical fashion, and combine such effects within a single model. This results in a very powerful regression modeling strategy. Furthermore, the penalized regression described has a connection to Bayesian regression with a normal, zero-mean prior for the coefficients, providing a path toward even more flexible modelingThe brms package serves as an easy to use starting point in R, and has functionality for using the mgcv package’s syntax style..

GAMLSS

Generalized additive models for location, scale, and shape (GAMLSS) allow for distributions beyond the exponential familySee Rigby and Stasinopoulos (2005) and the gamlss package, though mgcv has several options in this regard as well.

Other

In addition there are boosted, ensemble and other machine learning approaches that apply GAMs as wellSee the GAMens package for example. Also, boosted models are GAMs.. In short, there’s plenty to continue to explore once one gets the hang of generalized additive models.

Reproducing Kernel Hilbert Space

Generalized smoothing splines are built on the theory of reproducing kernel Hilbert spaces. I won’t pretend to be able to get into it here, but the idea is that the some forms of additive models can be represented in the inner product form used in RKHS approachesYou might note that the function used in the spline example in the appendix is called rk.. This connection lends itself to a tie between GAMs and e.g. support vector machines and similar methods. For the interested, I have an example of RKHS regression here.

Gaussian Processes

We can also approach modeling by using generalizations of the Gaussian distribution. Where the Gaussian distribution is over vectors and defined by a mean vector and covariance matrix, a Gaussian Process is a distribution over functions. A function \(f\) is distributed as a Gaussian Process defined by a mean function \(m\) and covariance function \(k\). They have a close tie to RKHS methods, and generalize commonly used models in spatial modeling.


Gaussian Process \(y = \sin(x) + \text{noise}\). The left graph shows functions from the prior distribution, the right shows the posterior mean function, 95% confidence interval shaded, as well as specific draws from the posterior predictive mean distribution.

\[f\sim \mathcal{GP}(m,k)\]

In the Bayesian context we can define a prior distribution over functions and make draws from a posterior predictive distribution of \(f\) once we have observed data. The reader is encouraged to consult Rasmussen and Williams (2006) for the necessary detail. The text is free for download, and Rasmussen also provides a nice and brief intro. I also have some R code for demonstration based on his Matlab code, as well as Bayesian examples in Stan.

Suffice it to say in this context, it turns out that generalized additive models with a tensor product or cubic spline smooth are maximum a posteriori (MAP) estimates of gaussian processes with specific covariance functions and a zero mean function. In that sense one might segue nicely to gaussian processes if familiar with additive models. The mgcv package also allows one to use a spline form of gaussian process.

Conclusion

Generalized additive models are a conceptually straightforward tool that allows one to incorporate nonlinear predictor effects into their otherwise linear models. In addition, they allow one to keep within the linear and generalized linear modeling frameworks with which one is already familiar, while providing new avenues of model exploration and possibly improved results. As was demonstrated, it is easy enough with just a modicum of familiarity to pull them off within the R environment, and as such, it is hoped that this document provides the means to do so.

Appendix

R packages

The following is a non-exhaustive list of R packages which contain GAM functionality. Each is linked to the CRAN page for the package. Note also that several build upon the mgcv package used for this document.

brms Allows for Bayesian GAMs via the Stan modeling language (very new implementation).

CausalGAM This package implements various estimators for average treatment effects.

COZIGAM Constrained and Unconstrained Zero-Inflated Generalized Additive Models.

CoxBoost This package provides routines for fitting Cox models. See also cph in rms package for nonlinear approaches in the survival context.

gam Functions for fitting and working with generalized additive models.

GAMBoost: This package provides routines for fitting generalized linear and and generalized additive models by likelihood based boosting.

gamboostLSS: Boosting models for fitting generalized additive models for location, shape and scale (gamLSS models).

GAMens: This package implements the GAMbag, GAMrsm and GAMens ensemble classifiers for binary classification.

gamlss: Generalized additive models for location, shape, and scale.

gamm4: Fit generalized additive mixed models via a version of mgcv’s gamm function.

gammSlice: Bayesian fitting and inference for generalized additive mixed models.

GMMBoost: Likelihood-based Boosting for Generalized mixed models.

gss: A comprehensive package for structural multivariate function estimation using smoothing splines.

mboost: Model-Based Boosting.

mgcv: Routines for GAMs and other generalized ridge regression with multiple smoothing parameter selection by GCV, REML or UBRE/AIC. Also GAMMs.

VGAM: Vector generalized linear and additive models, and associated models.

Penalized Estimation Example

Initial data set up and functions.

############################
### Wood by-hand example ###
############################

size = c(1.42,1.58,1.78,1.99,1.99,1.99,2.13,2.13,2.13,
         2.32,2.32,2.32,2.32,2.32,2.43,2.43,2.78,2.98,2.98)
wear = c(4.0,4.2,2.5,2.6,2.8,2.4,3.2,2.4,2.6,4.8,2.9,
         3.8,3.0,2.7,3.1,3.3,3.0,2.8,1.7)
x = size - min(size); x = x/max(x)
d = data.frame(wear, x)

#cubic spline function
rk <- function(x,z) {
  ((z-0.5)^2 - 1/12)*((x-0.5)^2 - 1/12)/4 -
    ((abs(x-z)-0.5)^4-(abs(x-z)-0.5)^2/2 + 7/240) / 24
}

spl.X <- function(x,knots){
  q <- length(knots) + 2                # number of parameters
  n <- length(x)                        # number of observations
  X <- matrix(1, n, q)                  # initialized model matrix
  X[,2] <- x                            # set second column to x
  X[,3:q] <- outer(x, knots, FUN = rk)  # remaining to cubic spline
  X
}

spl.S <- function(knots) {
  q = length(knots) + 2
  S = matrix(0, q, q)                           # initialize matrix
  S[3:q, 3:q] = outer(knots, knots, FUN=rk)     # fill in non-zero part
  S
}

#matrix square root function
mat.sqrt <- function(S){
  d = eigen(S, symmetric=T)
  rS = d$vectors %*% diag(d$values^.5) %*% t(d$vectors)
  rS
}

#the fitting function
prs.fit <- function(y, x, knots, lambda) {
  q = length(knots) + 2       # dimension of basis
  n = length(x)               # number of observations
  Xa = rbind(spl.X(x, knots), mat.sqrt(spl.S(knots))*sqrt(lambda))  # augmented model matrix
  y[(n+1):(n+q)] = 0          # augment the data vector
  lm(y ~ Xa - 1)              # fit and return penalized regression spline
}

Example 1.

knots = 1:4/5
X = spl.X(x, knots)           # generate model matrix
mod.1 = lm(wear ~ X - 1)      # fit model
xp <- 0:100/100               # x values for prediction
Xp <- spl.X(xp, knots)        # prediction matrix

Example 2.

knots = 1:7/8

pisa2 = data.frame(x=xp)

for (i in c(.1, .01, .001, .0001, .00001, .000001)) {
  mod.2 = prs.fit(y=wear, x=x, knots=knots, lambda=i) # fit penalized regression 
                                # spline choosing lambda
  Xp = spl.X(xp,knots)          # matrix to map parameters to fitted values at xp
  pisa2[,paste('lambda = ',i, sep="")] = Xp%*%coef(mod.2)
}

Basis functions for a cubic spline

We can take the same approach as with the polynomial example earlier in the document, and take a closer look at the basis functions. While we’re using the same approach here as above, you can inspect this in one of two ways with the mgcv package. One is to use the fit=FALSE argument, which will return the model matrix instead of a fitted gam objectDoing so actually returns an list of many elements. The one named X is the model matrix.. Secondly, after a model is run, you can use the predict function on any data with the argument type='lpmatrix' to return the matrix.

The following is arranged in order of the Income values.

   Income    X1         X2         X3    X4 X5 X6 X7 X8 X9
1   0.407 0.007 0.00000000 0.00000000 0.000  0  0  0  0  0
2   0.484 0.084 0.01733333 0.00000000 0.000  0  0  0  0  0
3   0.552 0.152 0.08533333 0.01866667 0.000  0  0  0  0  0
4   0.566 0.166 0.09933333 0.03266667 0.000  0  0  0  0  0
5   0.597 0.197 0.13033333 0.06366667 0.000  0  0  0  0  0
6   0.603 0.203 0.13633333 0.06966667 0.003  0  0  0  0  0
7   0.616 0.216 0.14933333 0.08266667 0.016  0  0  0  0  0
8   0.637 0.237 0.17033333 0.10366667 0.037  0  0  0  0  0
9   0.642 0.242 0.17533333 0.10866667 0.042  0  0  0  0  0
10  0.647 0.247 0.18033333 0.11366667 0.047  0  0  0  0  0

Note that we have many more columns than we normally would fit with a polynomial regression, and this is dealt with through the penalized approach we’ve discussed. The first column represents the linear form of Income, though it is scaled differently (i.e. the correlation is perfect). Note also that as you move along Income and past knot values, some basis functions will have zero values as they don’t apply to that range of Income.

If we plot these against Income, it doesn’t look like much, but we can tell where the knots are.

However, if we plot Income against the basis multiplied by the estimated coefficients of the model, the data story begins to reveal itself.

Each trend line represents the fit at the knot points. For example, The intercept is the starting point, and from there the function is initially increasing. After the second knot point that initial increase starts to slow down. If it helps, think of a quadratic function. This is similar to a scenario where the coefficient for \(x\) is positive while it is negative and smaller for \(x^2\). We then see the trend turn strongly positive again and so on. Here is the final plot. Note that since we used lm, this is an unpenalized fit and so overfits the data a bit at the end where there extreme and few values. This serves also as an insight into the benefits of the penalization process. Using the gam function, our fit would not drop so much for Qatar.

Having seen the innards of the process of building additive models, hopefully this will instill a little more insight and confidence using the technique.

References

Breiman, Leo. 2001. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16 (3): 199–231. doi:10.1214/ss/1009213726.

Bybee, Rodger W., and Barry McCrae. 2009. Pisa Science 2006: Implications for Science Teachers and Teaching. NSTA Press.

Fahrmeir, Ludwig, Thomas Kneib, Stefan Lang, and Brian Marx. 2013. Regression: Models, Methods and Applications. Springer Science & Business Media.

Fox, John. 2000a. Nonparametric Simple Regression: Smoothing Scatterplots. SAGE.

———. 2000b. Multiple and Generalized Nonparametric Regression. SAGE.

Friedman, Jerome, Trevor Hastie, Robert Tibshirani, and others. 2000. “Additive Logistic Regression: A Statistical View of Boosting (with Discussion and a Rejoinder by the Authors).” The Annals of Statistics 28 (2). Institute of Mathematical Statistics: 337–407.

Hardin, James W., and Joseph M. Hilbe. 2012. Generalized Linear Models and Extensions, Third Edition. Stata Press.

Hastie, T.J., and R.J. Tibshirani. 1990. Generalized Additive Models. CRC Press.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. 2nd ed. 2009. Corr. 3rd printing 5th Printing. Springer.

Rasmussen, Carl Edward, and Christopher K. I Williams. 2006. Gaussian Processes for Machine Learning. Cambridge, Mass.: MIT Press.

Rigby, R. A., and D. M. Stasinopoulos. 2005. “Generalized Additive Models for Location, Scale and Shape.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 54 (3). doi:10.1111/j.1467-9876.2005.00510.x.

Ruppert, David, Matt P. Wand, and Raymond J. Carroll. 2003. Semiparametric Regression. Cambridge University Press.

Venables, William N., and Brian D. Ripley. 2002. Modern Applied Statistics with S. Birkhäuser.

Wasserman, Larry. 2006. All of Nonparametric Statistics. Springer.

Wood, S. N. 2006. Generalized Additive Models: An Introduction with R. Vol. 66. CRC Press.

Wood, Simon N. 2008. “Fast Stable Direct Fitting and Smoothness Selection for Generalized Additive Models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70 (3). Wiley Online Library: 495–518.