02.3 Regression and Cross-Validation Workshop

Here we will consider two datasets that exhibit different structural features and therefore different implications for Cross-Validation.

0. Requirements

Its good practice to get the requirements right at the top. The following solution checks for the requirements, and then installs them if they are not present.

if(!require("ggplot2")) install.packages("ggplot2")
## Loading required package: ggplot2
if(!require("caret")) install.packages("caret")
## Loading required package: caret
## Loading required package: lattice

1. Data

1.1 Diamonds Dataset

The Diamonds Dataset is a classic dataset for over 50k diamonds, describing their price and features. You can read more about it in many places, for example Kaggle (though it did not originate there). Its built into the ggplot2 package in R and can be accessed directly:

library("ggplot2") # Ironically we're using this for the data, and not for the plotting!
head(diamonds)
## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

The ordinal features are “cut”, “color” and “clarity”, with “x” “y” and “z” describing physical dimensions and “depth” and “table” being further shape characteristics relating those physical dimensions (expressed as percentages).

There are some data issues which we do not deal with fully, but can partially address by removing any data with impossible spatial measurements:

d2=data.frame(diamonds)
numericcols=c(1,5:10)
catcols=2:4
## Perform a minimal data sanitization step
d2=d2[-which((d2$x==0)|(d2$y==0)|(d2$z==0)|(d2$y>15)|(d2$z>15)),]
plot(d2[,numericcols],pch=19,col="#00000033",cex=0.7)

The Diamonds dataset is feature rich and may be expected to have independence between observations.

1.2 Berkeley Earth Temperature Record

The Earth Temperature record shows the famous “hockey stick” of global warming, containing only the high-resolution data from 1850 onwards. It contains multiple timescales of both land and ocean global temperatures, and contains uncertainty.

Reference:

Rohde, R. A. and Hausfather, Z.: The Berkeley Earth Land/Ocean Temperature Record, Earth Syst. Sci. Data, 12, 3469-3479, https://doi.org/10.5194/essd-12-3469-2020, 2020.

Because this dataset is small, it can be accessed directly from the internet, with no need to download locally. The format is slightly confusing with separate blocks of content for land and ocean, which can be worked around by just reading the correct lines for the block we are interested in. These are the land temperatures measured with respect to an 1850 baseline:

#                 Monthly          Annual          Five-year        Ten-year        Twenty-year
# Year, Month,  Anomaly, Unc.,   Anomaly, Unc.,   Anomaly, Unc.,   Anomaly, Unc.,   Anomaly, Unc.ß
temperature=read.table("https://berkeley-earth-temperature.s3.us-west-1.amazonaws.com/Global/Land_and_Ocean_complete.txt",skip = 86,nrows = 1997)
colnames(temperature)=c("Year","Month","MA","MACI","AA","AACI","A5","A5CI","A10","A10CI","A20","A20CI")
temperature$Time=temperature$Year+(temperature$Month-1)/12
plot(temperature$Time,temperature$MA,xlab="Date",ylab="Monthly Anomaly")

Although there is considerable heterogeneity in the confidence in the monthly estimates, due to reduced data back in time and structural variation in the signal, we will not deal with that in this workshop. Here are the confidence interval sizes:

plot(temperature$Time,temperature$MACI,xlab="Date",ylab="Monthly Anomaly Confidence",ylim=c(0,0.55))

The organisation separated out “Year” and “Month” so we create a continuous time variable called “Time” combining them. Each average (Month, Year, 5 Year, 10 Year and 20 Year) comes with a corresponding confidence 95% interval.

In contrast to the Diamonds data, there are no covariates and no expectation of independence between the observations.

2. Exploratory analyses

As an initial exploration we will run some linear models using all of the data, to establish whether there is power and what broad signals are present.

2.1 Diamonds

We will try to predict price, because of course…!

We start with a regression using the continuous variables.

dlm2=lm(price~carat + depth + table +x+y+z,data=d2)
summary(dlm2)
## 
## Call:
## lm(formula = price ~ carat + depth + table + x + y + z, data = d2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23662.7   -617.6    -62.5    353.0  12746.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11376.602    817.326   13.92  < 2e-16 ***
## carat       10975.754     66.893  164.08  < 2e-16 ***
## depth         -52.232     12.408   -4.21 2.56e-05 ***
## table         -92.943      3.092  -30.06  < 2e-16 ***
## x           -2940.646    126.800  -23.19  < 2e-16 ***
## y            3101.430    128.251   24.18  < 2e-16 ***
## z           -2419.217    193.289  -12.52  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1486 on 53910 degrees of freedom
## Multiple R-squared:  0.8611, Adjusted R-squared:  0.8611 
## F-statistic: 5.572e+04 on 6 and 53910 DF,  p-value: < 2.2e-16

R reports the regression estimates and the p-value for each parameter marginally, as well as the F-statistic which quantifies whether the combined covariates have predicted power at all.

We now examine the complete regression of price against all other covariates.

dlm=lm(price~.,data=d2)
summary(dlm)
## 
## Call:
## lm(formula = price ~ ., data = d2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21477.3   -587.9   -183.6    371.2  10754.9 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -1543.593    649.998   -2.375  0.01756 *  
## carat       11547.957     51.682  223.445  < 2e-16 ***
## cut.L         549.420     22.439   24.485  < 2e-16 ***
## cut.Q        -249.730     18.197  -13.724  < 2e-16 ***
## cut.C          87.567     16.033    5.462 4.74e-08 ***
## cut^4         -40.815     12.494   -3.267  0.00109 ** 
## color.L     -1961.198     17.254 -113.664  < 2e-16 ***
## color.Q      -682.479     15.699  -43.472  < 2e-16 ***
## color.C      -165.063     14.648  -11.269  < 2e-16 ***
## color^4        44.055     13.457    3.274  0.00106 ** 
## color^5       -93.815     12.712   -7.380 1.60e-13 ***
## color^6       -46.697     11.555   -4.041 5.32e-05 ***
## clarity.L    4054.359     30.226  134.136  < 2e-16 ***
## clarity.Q   -1917.209     28.149  -68.109  < 2e-16 ***
## clarity.C     974.411     24.079   40.467  < 2e-16 ***
## clarity^4    -359.722     19.202  -18.733  < 2e-16 ***
## clarity^5     231.052     15.679   14.737  < 2e-16 ***
## clarity^6       3.976     13.642    0.291  0.77070    
## clarity^7      90.274     12.040    7.498 6.57e-14 ***
## depth          60.464      9.551    6.331 2.46e-10 ***
## table         -24.766      2.900   -8.539  < 2e-16 ***
## x           -1510.527    103.072  -14.655  < 2e-16 ***
## y            1702.262    105.423   16.147  < 2e-16 ***
## z           -2175.634    146.594  -14.841  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1124 on 53893 degrees of freedom
## Multiple R-squared:  0.9206, Adjusted R-squared:  0.9205 
## F-statistic: 2.716e+04 on 23 and 53893 DF,  p-value: < 2.2e-16

To interpret this, it is essential to understand how categories are treated.

  • R implicitly one-hot encodes categorical variables, i.e. constructs a column for each category, containing value 1 if the observation took that category and 0 otherwise.
  • However, these are ordered factors (check using class(d2$clarity)), which get special treatment (see https://library.virginia.edu/data/articles/understanding-ordered-factors-in-a-linear-model), using a linear, quadratic, cubic, etc model on the order. There are always n-1 powers included for n categories.

Here is the order for the cut variable:

table(d2$cut)
## 
##      Fair      Good Very Good   Premium     Ideal 
##      1609      4902     12080     13779     21547

See Question B2W1 in Block 2 Portfolio

about the interpretation of these results.

2.2 Temperatures

We first check whether there is a significant change over time using a standard regression:

templm=lm(MA~Time,data=temperature)
summary(templm)
## 
## Call:
## lm(formula = MA ~ Time, data = temperature)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65771 -0.15419 -0.00897  0.14259  0.96768 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.342548   0.197309  -62.55   <2e-16 ***
## Time          0.006358   0.000102   62.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.219 on 1995 degrees of freedom
## Multiple R-squared:  0.6606, Adjusted R-squared:  0.6604 
## F-statistic:  3883 on 1 and 1995 DF,  p-value: < 2.2e-16

Of course the answer is yes. However, the data were clearly non-linear. Lets now fit a polynomial to the data:

head(temperature)
##   Year Month     MA  MACI     AA  AACI  A5 A5CI A10 A10CI A20 A20CI     Time
## 1 1850     1 -0.790 0.434    NaN   NaN NaN  NaN NaN   NaN NaN   NaN 1850.000
## 2 1850     2 -0.236 0.479    NaN   NaN NaN  NaN NaN   NaN NaN   NaN 1850.083
## 3 1850     3 -0.396 0.397    NaN   NaN NaN  NaN NaN   NaN NaN   NaN 1850.167
## 4 1850     4 -0.629 0.318    NaN   NaN NaN  NaN NaN   NaN NaN   NaN 1850.250
## 5 1850     5 -0.662 0.268    NaN   NaN NaN  NaN NaN   NaN NaN   NaN 1850.333
## 6 1850     6 -0.387 0.289 -0.475 0.175 NaN  NaN NaN   NaN NaN   NaN 1850.417
temppoly=lm(MA~poly(Time,5),data=temperature)
summary(temppoly)
## 
## Call:
## lm(formula = MA ~ poly(Time, 5), data = temperature)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70421 -0.11372  0.00214  0.11066  0.88718 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.050973   0.003855 -13.224  < 2e-16 ***
## poly(Time, 5)1 13.650023   0.172250  79.246  < 2e-16 ***
## poly(Time, 5)2  5.737827   0.172250  33.311  < 2e-16 ***
## poly(Time, 5)3  1.454252   0.172250   8.443  < 2e-16 ***
## poly(Time, 5)4  0.457944   0.172250   2.659  0.00791 ** 
## poly(Time, 5)5  1.185041   0.172250   6.880    8e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1722 on 1991 degrees of freedom
## Multiple R-squared:  0.7906, Adjusted R-squared:   0.79 
## F-statistic:  1503 on 5 and 1991 DF,  p-value: < 2.2e-16

Perhaps unsurprisingly, all terms are significant, though the symmetric 4-th degree term has lower significance.

templmpred <- predict(templm,temperature)
temppolypred <- predict(temppoly,temperature)
plot(temperature$Time,temperature$MA,xlab="Date",ylab="Monthly Anomaly",pch=19,cex=0.5,col="grey")
lines(temperature$Time,templmpred,col="red",lwd=2)
lines(temperature$Time,temppolypred,col="blue",lwd=2)
legend("topleft",legend=c("Data","Predicted (linear)","Predicted (polynomial)"),lty=c(NA,1,1),pch=c(19,NA,NA),col=c("grey","red","blue"),text.col=c("grey","red","blue"))

See Question B2W2 in Block 2 Portfolio

about the interpretation of these results.

3. Out of sample prediction

We will use the library “caret” for cross-validation training and control. This is not strictly necessary, but provides some helpful features in the computation of AIC and performing stepwise regression to search model space.

library(caret)

3.1 Diamonds

We will perform the following cross-validation tests on a very small subset of the diamonds dataset. You are encouraged to repeat these analyses on the full dataset. Before you do, think about what you predict will change when we have 50k samples instead of 2k?

First construct a test/train split:

set.seed(2)
## Downsample the data for computational convenience
mysamples=sample(dim(d2)[1],2000)
smalld2=d2[mysamples,]
## Make a test/train split
s=createDataPartition(1:dim(smalld2)[1],p=0.8,list=FALSE)
d2train=smalld2[s,]
d2test=smalld2[-s,]

Now we perform training. The Caret package distinguishes between three classes of data:

  • Training
  • Validation
  • Testing

To do this efficiently, it uses K-fold CV to train on K-1 folds and validate on 1 fold of the training data. It can loop over all choices of validation fold, increasing the compute K-times but reducing sampling variability. The Testing data is then completely unpolluted by the training step.

Validation data is only needed if there is some hyper parameter to be fit, but it also can be of use in regression for model (i.e. variable) selection.

Here is how we can train a linear and a quadratic model this way:

## Learn a model on the training data, and use it to predict the test data
## Definition of a linear model we will refer to as **lin**:
modelcv <- train(price ~ ., data = d2train, method = "lm",trControl=trainControl(method = "cv",number=5))
## Definition of an interaction model we will refer to as **int**:
modelcvint <- train(price ~ .^2, data = d2train, method = "lm",trControl=trainControl(method = "cv",number=5))
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

We can then compute the \(R^2\) between the observed and predicted values for both the test and train data:

comparison=c(testlin=R2(pred = predict(modelcv,d2test),obs = d2test$price),
             trainlin=R2(pred = predict(modelcv,d2train),obs = d2train$price),
             testint=R2(pred = predict(modelcvint,d2test),obs = d2test$price),
             trainint=R2(pred = predict(modelcvint,d2train),obs = d2train$price)
      )
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
print(comparison)
##   testlin  trainlin   testint  trainint 
## 0.9201718 0.9253244 0.4455863 0.9805768

We can now go on to perform step-wise regression to minimise AIC (setting trace=FALSE to suppress the details of the steps):

modelcvstep <- train(price~.,
                     d2train,
                     method="lmStepAIC", # Step wise AIC
                     direction="both", # Forward-backward stepwise selection
                     trControl=trainControl(method = "cv",number=5),trace=FALSE)
summary(modelcvstep)
## 
## Call:
## lm(formula = .outcome ~ carat + cut.L + cut.Q + cut.C + color.L + 
##     color.Q + color.C + `color^5` + clarity.L + clarity.Q + clarity.C + 
##     `clarity^5` + `clarity^7` + depth + x + y + z, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6245.7  -542.0  -174.9   336.4  9645.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4643.31    3186.40  -1.457 0.145252    
## carat       12618.18     309.18  40.812  < 2e-16 ***
## cut.L         650.30     123.07   5.284 1.44e-07 ***
## cut.Q        -259.05     102.80  -2.520 0.011837 *  
## cut.C         235.37      85.59   2.750 0.006027 ** 
## color.L     -2112.73      95.55 -22.110  < 2e-16 ***
## color.Q      -776.86      84.93  -9.147  < 2e-16 ***
## color.C      -304.72      78.72  -3.871 0.000113 ***
## `color^5`    -157.95      68.34  -2.311 0.020950 *  
## clarity.L    3821.46     161.19  23.708  < 2e-16 ***
## clarity.Q   -1636.05     126.68 -12.915  < 2e-16 ***
## clarity.C     895.36     134.29   6.667 3.59e-11 ***
## `clarity^5`   307.10      85.34   3.599 0.000330 ***
## `clarity^7`   111.18      63.89   1.740 0.082021 .  
## depth         110.98      52.43   2.117 0.034446 *  
## x           -1536.53     620.60  -2.476 0.013394 *  
## y            2113.26     637.64   3.314 0.000940 ***
## z           -3447.03     851.02  -4.050 5.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1066 on 1582 degrees of freedom
## Multiple R-squared:  0.9252, Adjusted R-squared:  0.9244 
## F-statistic:  1151 on 17 and 1582 DF,  p-value: < 2.2e-16

Running step-wise regression for all quadratic terms is a lot slower. Here we restrict interactions to only quantitative variables which reduces the search space enough to make the inference tolerable:

## Definition of a model we will refer to as **intstep**:
modelcvintstep <- train(price~color + clarity + (depth +table+x+y+z)^2,
                     d2train,
                     method="lmStepAIC", # Step wise AIC
                     direction="both", # Forward-backward stepwise selection
                     trControl=trainControl(method = "cv",number=5),trace=FALSE)
summary(modelcvintstep)
## 
## Call:
## lm(formula = .outcome ~ color.L + color.Q + color.C + `color^5` + 
##     clarity.L + clarity.Q + clarity.C + `clarity^5` + `clarity^7` + 
##     depth + table + x + `depth:table` + `depth:x` + `depth:z` + 
##     `table:y` + `table:z` + `x:y` + `x:z` + `y:z`, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4622.4  -527.0  -163.5   336.2  9323.6 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -85777.123  27516.509  -3.117 0.001858 ** 
## color.L        -2086.457     92.211 -22.627  < 2e-16 ***
## color.Q         -790.498     82.180  -9.619  < 2e-16 ***
## color.C         -298.028     75.959  -3.924 9.10e-05 ***
## `color^5`       -195.818     65.926  -2.970 0.003020 ** 
## clarity.L       3799.973    155.106  24.499  < 2e-16 ***
## clarity.Q      -1521.154    123.194 -12.348  < 2e-16 ***
## clarity.C        884.626    130.106   6.799 1.49e-11 ***
## `clarity^5`      305.176     82.395   3.704 0.000220 ***
## `clarity^7`      118.552     61.667   1.922 0.054730 .  
## depth           1396.958    434.402   3.216 0.001327 ** 
## table           2971.518    511.833   5.806 7.74e-09 ***
## x             -12835.131   5034.996  -2.549 0.010891 *  
## `depth:table`    -43.991      8.319  -5.288 1.41e-07 ***
## `depth:x`        310.184     70.298   4.412 1.09e-05 ***
## `depth:z`       -300.541     79.442  -3.783 0.000161 ***
## `table:y`       -581.260     67.881  -8.563  < 2e-16 ***
## `table:z`        869.942    112.814   7.711 2.19e-14 ***
## `x:y`           3256.282    799.134   4.075 4.83e-05 ***
## `x:z`          -7655.040   1023.143  -7.482 1.21e-13 ***
## `y:z`           3926.503    755.665   5.196 2.30e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1029 on 1579 degrees of freedom
## Multiple R-squared:  0.9304, Adjusted R-squared:  0.9295 
## F-statistic:  1056 on 20 and 1579 DF,  p-value: < 2.2e-16

We can now compare all of our predictions:

comparisonall= c(comparison,
  teststep=R2(pred = predict(modelcvstep,d2test),obs = d2test$price),
  trainstep=R2(pred = predict(modelcvstep,d2train),obs = d2train$price),  
  testintstep=R2(pred = predict(modelcvintstep,d2test),obs = d2test$price),
  trainintstep=R2(pred = predict(modelcvintstep,d2train),obs = d2train$price)    
)
print(comparisonall)
##      testlin     trainlin      testint     trainint     teststep    trainstep 
##    0.9201718    0.9253244    0.4455863    0.9805768    0.9201460    0.9251857 
##  testintstep trainintstep 
##    0.9247944    0.9304239

See Question B2W3 in Block 2 Portfolio

about the interpretation of these results.

3.2 Temperatures Cross Validation

We now repeat the analysis with the temperatures data. Temperature already has 1997 rows so there is no need to downsample to have a comparable sized dataset.

set.seed(1)
## Downsample the data for computational convenience
## Make a test/train split
temperature2=na.omit(temperature[,c("Time","MA")])
temps=createDataPartition(1:dim(temperature2)[1],p=0.8,list=FALSE)
temptrain=temperature2[temps,]
temptest=temperature2[-temps,]

Now we can train models exactly as above:

## Learn a model on the training data, and use it to predict the test data
tempmodelcv <- train(MA ~ Time, data = temptrain, method = "lm",trControl=trainControl(method = "cv",number=5))
tempmodelcvint <- train(MA ~ poly(Time,5), data = temptrain, method = "lm",trControl=trainControl(method = "cv",number=5))

And compute out-of-sample \(R^2\):

tempcompare=c(testlin=R2(pred = predict(tempmodelcv,temptest),obs = temptest$MA),
             trainlin=R2(pred = predict(tempmodelcv,temptrain),obs = temptrain$MA),
             testint=R2(pred = predict(tempmodelcvint,temptest),obs = temptest$MA),
             trainint=R2(pred = predict(tempmodelcvint,temptrain),obs = temptrain$MA)
      )
tempcompare
##   testlin  trainlin   testint  trainint 
## 0.7138730 0.6467809 0.8054908 0.7865215

And of course an advantage of a 1d problem is that we can visualise it. This plot is for the test data:

temptestlmpred <- predict(tempmodelcv,temptest)
temptestpolypred <- predict(tempmodelcvint,temptest)
plot(temptest$Time,temptest$MA,xlab="Date",ylab="Monthly Anomaly",pch=".")
lines(temptest$Time,temptestlmpred,col="red",lwd=2)
lines(temptest$Time,temptestpolypred,col="blue",lwd=2)
legend("topleft",legend=c("Data","Predicted (linear)","Predicted (polynomial)"),lty=c(NA,1,1),pch=c(".",NA,NA),col=c("black","red","blue"),text.col=c("black","red","blue"))

See Question B2W4 in Block 2 Portfolio

about the interpretation of these results.