Here we will consider two datasets that exhibit different structural features and therefore different implications for Cross-Validation.
Its good practice to get the requirements right at the top. The following solution checks for the requirements, and then installs them if they are not present.
if(!require("ggplot2")) install.packages("ggplot2")
## Loading required package: ggplot2
if(!require("caret")) install.packages("caret")
## Loading required package: caret
## Loading required package: lattice
Q: How important are version numbers for requirements? How would you investigate solutions to force R to use a particular version of each package?
The Diamonds Dataset is a classic dataset for over 50k diamonds,
describing their price and features. You can read more about it in many
places, for example Kaggle
(though it did not originate there). Its built into the
ggplot2
package in R and can be accessed directly:
library("ggplot2") # Ironically we're using this for the data, and not for the plotting!
head(diamonds)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
The ordinal features are “cut”, “color” and “clarity”, with “x” “y” and “z” describing physical dimensions and “depth” and “table” being further shape characteristics relating those physical dimensions (expressed as percentages).
There are some data issues which we do not deal with fully, but can partially address by removing any data with impossible spatial measurements:
d2=data.frame(diamonds)
numericcols=c(1,5:10)
catcols=2:4
## Perform a minimal data sanitization step
d2=d2[-which((d2$x==0)|(d2$y==0)|(d2$z==0)|(d2$y>15)|(d2$z>15)),]
plot(d2[,numericcols],pch=19,col="#00000033",cex=0.7)
The Diamonds dataset is feature rich and may be expected to have independence between observations.
Q: How do we come by this information? How would you know what
data massaging
to do on another dataset?
The Earth Temperature record shows the famous “hockey stick” of global warming, containing only the high-resolution data from 1850 onwards. It contains multiple timescales of both land and ocean global temperatures, and contains uncertainty.
Reference:
Rohde, R. A. and Hausfather, Z.: The Berkeley Earth Land/Ocean Temperature Record, Earth Syst. Sci. Data, 12, 3469-3479, https://doi.org/10.5194/essd-12-3469-2020, 2020.
Because this dataset is small, it can be accessed directly from the internet, with no need to download locally. The format is slightly confusing with separate blocks of content for land and ocean, which can be worked around by just reading the correct lines for the block we are interested in. These are the land temperatures measured with respect to an 1850 baseline:
# Monthly Annual Five-year Ten-year Twenty-year
# Year, Month, Anomaly, Unc., Anomaly, Unc., Anomaly, Unc., Anomaly, Unc., Anomaly, Unc.ß
temperature=read.table("https://berkeley-earth-temperature.s3.us-west-1.amazonaws.com/Global/Land_and_Ocean_complete.txt",skip = 86,nrows = 1997)
colnames(temperature)=c("Year","Month","MA","MACI","AA","AACI","A5","A5CI","A10","A10CI","A20","A20CI")
temperature$Time=temperature$Year+(temperature$Month-1)/12
plot(temperature$Time,temperature$MA,xlab="Date",ylab="Monthly Anomaly")
Although there is considerable heterogeneity in the confidence in the monthly estimates, due to reduced data back in time and structural variation in the signal, we will not deal with that in this workshop. Here are the confidence interval sizes:
plot(temperature$Time,temperature$MACI,xlab="Date",ylab="Monthly Anomaly Confidence",ylim=c(0,0.55))
The organisation separated out “Year” and “Month” so we create a continuous time variable called “Time” combining them. Each average (Month, Year, 5 Year, 10 Year and 20 Year) comes with a corresponding confidence 95% interval.
In contrast to the Diamonds data, there are no covariates and no expectation of independence between the observations.
As an initial exploration we will run some linear models using all of the data, to establish whether there is power and what broad signals are present.
We will try to predict price, because of course…!
We start with a regression using the continuous variables.
dlm2=lm(price~carat + depth + table +x+y+z,data=d2)
summary(dlm2)
##
## Call:
## lm(formula = price ~ carat + depth + table + x + y + z, data = d2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23662.7 -617.6 -62.5 353.0 12746.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11376.602 817.326 13.92 < 2e-16 ***
## carat 10975.754 66.893 164.08 < 2e-16 ***
## depth -52.232 12.408 -4.21 2.56e-05 ***
## table -92.943 3.092 -30.06 < 2e-16 ***
## x -2940.646 126.800 -23.19 < 2e-16 ***
## y 3101.430 128.251 24.18 < 2e-16 ***
## z -2419.217 193.289 -12.52 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1486 on 53910 degrees of freedom
## Multiple R-squared: 0.8611, Adjusted R-squared: 0.8611
## F-statistic: 5.572e+04 on 6 and 53910 DF, p-value: < 2.2e-16
R reports the regression estimates and the p-value for each parameter marginally, as well as the F-statistic which quantifies whether the combined covariates have predicted power at all.
Q: What can we conclude from such an analysis?
We now examine the complete regression of price against all other covariates.
dlm=lm(price~.,data=d2)
summary(dlm)
##
## Call:
## lm(formula = price ~ ., data = d2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21477.3 -587.9 -183.6 371.2 10754.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1543.593 649.998 -2.375 0.01756 *
## carat 11547.957 51.682 223.445 < 2e-16 ***
## cut.L 549.420 22.439 24.485 < 2e-16 ***
## cut.Q -249.730 18.197 -13.724 < 2e-16 ***
## cut.C 87.567 16.033 5.462 4.74e-08 ***
## cut^4 -40.815 12.494 -3.267 0.00109 **
## color.L -1961.198 17.254 -113.664 < 2e-16 ***
## color.Q -682.479 15.699 -43.472 < 2e-16 ***
## color.C -165.063 14.648 -11.269 < 2e-16 ***
## color^4 44.055 13.457 3.274 0.00106 **
## color^5 -93.815 12.712 -7.380 1.60e-13 ***
## color^6 -46.697 11.555 -4.041 5.32e-05 ***
## clarity.L 4054.359 30.226 134.136 < 2e-16 ***
## clarity.Q -1917.209 28.149 -68.109 < 2e-16 ***
## clarity.C 974.411 24.079 40.467 < 2e-16 ***
## clarity^4 -359.722 19.202 -18.733 < 2e-16 ***
## clarity^5 231.052 15.679 14.737 < 2e-16 ***
## clarity^6 3.976 13.642 0.291 0.77070
## clarity^7 90.274 12.040 7.498 6.57e-14 ***
## depth 60.464 9.551 6.331 2.46e-10 ***
## table -24.766 2.900 -8.539 < 2e-16 ***
## x -1510.527 103.072 -14.655 < 2e-16 ***
## y 1702.262 105.423 16.147 < 2e-16 ***
## z -2175.634 146.594 -14.841 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1124 on 53893 degrees of freedom
## Multiple R-squared: 0.9206, Adjusted R-squared: 0.9205
## F-statistic: 2.716e+04 on 23 and 53893 DF, p-value: < 2.2e-16
To interpret this, it is essential to understand how categories are treated.
class(d2$clarity)
), which get special treatment (see https://library.virginia.edu/data/articles/understanding-ordered-factors-in-a-linear-model),
using a linear, quadratic, cubic, etc model
on the order. There are always n-1 powers included for
n categories.Here is the order for the cut variable:
table(d2$cut)
##
## Fair Good Very Good Premium Ideal
## 1609 4902 12080 13779 21547
Q: How would we force cut
to be treated as an ordered
factor? What problems might this create?
about the interpretation of these results.
We first check whether there is a significant change over time using a standard regression:
templm=lm(MA~Time,data=temperature)
summary(templm)
##
## Call:
## lm(formula = MA ~ Time, data = temperature)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6564 -0.1533 -0.0098 0.1429 0.9569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.214e+01 1.976e-01 -61.42 <2e-16 ***
## Time 6.254e-03 1.022e-04 61.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2194 on 1995 degrees of freedom
## Multiple R-squared: 0.6525, Adjusted R-squared: 0.6523
## F-statistic: 3746 on 1 and 1995 DF, p-value: < 2.2e-16
Q: Is the association statistically significant? Is it practically significant?
The data were clearly non-linear. Lets now fit a polynomial to the data:
head(temperature)
## Year Month MA MACI AA AACI A5 A5CI A10 A10CI A20 A20CI Time
## 1 1850 1 -0.788 0.390 NaN NaN NaN NaN NaN NaN NaN NaN 1850.000
## 2 1850 2 -0.240 0.492 NaN NaN NaN NaN NaN NaN NaN NaN 1850.083
## 3 1850 3 -0.400 0.361 NaN NaN NaN NaN NaN NaN NaN NaN 1850.167
## 4 1850 4 -0.629 0.296 NaN NaN NaN NaN NaN NaN NaN NaN 1850.250
## 5 1850 5 -0.660 0.331 NaN NaN NaN NaN NaN NaN NaN NaN 1850.333
## 6 1850 6 -0.386 0.286 -0.473 0.173 NaN NaN NaN NaN NaN NaN 1850.417
temppoly=lm(MA~poly(Time,5),data=temperature)
summary(temppoly)
##
## Call:
## lm(formula = MA ~ poly(Time, 5), data = temperature)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.71972 -0.11430 0.00165 0.11381 0.87303
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.044925 0.003879 -11.583 < 2e-16 ***
## poly(Time, 5)1 13.427010 0.173330 77.465 < 2e-16 ***
## poly(Time, 5)2 5.657368 0.173330 32.639 < 2e-16 ***
## poly(Time, 5)3 1.571716 0.173330 9.068 < 2e-16 ***
## poly(Time, 5)4 0.423770 0.173330 2.445 0.0146 *
## poly(Time, 5)5 1.237918 0.173330 7.142 1.28e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1733 on 1991 degrees of freedom
## Multiple R-squared: 0.7835, Adjusted R-squared: 0.783
## F-statistic: 1441 on 5 and 1991 DF, p-value: < 2.2e-16
Q: The symmetric 4-th degree term has lower significance. Does this tell us anything of value?
templmpred <- predict(templm,temperature)
temppolypred <- predict(temppoly,temperature)
plot(temperature$Time,temperature$MA,xlab="Date",ylab="Monthly Anomaly",pch=19,cex=0.5,col="grey")
lines(temperature$Time,templmpred,col="red",lwd=2)
lines(temperature$Time,temppolypred,col="blue",lwd=2)
legend("topleft",legend=c("Data","Predicted (linear)","Predicted (polynomial)"),lty=c(NA,1,1),pch=c(19,NA,NA),col=c("grey","red","blue"),text.col=c("grey","red","blue"))
about the interpretation of these results.
We will use the library “caret” for cross-validation training and control. This is not strictly necessary, but provides some helpful features in the computation of AIC and performing stepwise regression to search model space.
library(caret)
We will perform the following cross-validation tests on a very small subset of the diamonds dataset. You are encouraged to repeat these analyses on the full dataset. Before you do, think about what you predict will change when we have 50k samples instead of 2k?
First construct a test/train split:
set.seed(2)
## Downsample the data for computational convenience
mysamples=sample(dim(d2)[1],2000)
smalld2=d2[mysamples,]
## Make a test/train split
s=createDataPartition(1:dim(smalld2)[1],p=0.8,list=FALSE)
d2train=smalld2[s,]
d2test=smalld2[-s,]
Now we perform training. The Caret package distinguishes between three classes of data:
To do this efficiently, it uses K-fold CV to train on K-1 folds and validate on 1 fold of the training data. It can loop over all choices of validation fold, increasing the compute K-times but reducing sampling variability. The Testing data is then completely unpolluted by the training step.
Validation data is only needed if there is some hyper parameter to be fit, but it also can be of use in regression for model (i.e. variable) selection.
Q: How do we find out all of this about the implementation in the
caret
package? Are there other packages for Cross
Validation? When might they be appropriate?
Here is how we can train a linear and a quadratic model this way:
## Learn a model on the training data, and use it to predict the test data
## Definition of a linear model we will refer to as **lin**:
modelcv <- train(price ~ ., data = d2train, method = "lm",trControl=trainControl(method = "cv",number=5))
## Definition of an interaction model we will refer to as **int**:
modelcvint <- train(price ~ .^2, data = d2train, method = "lm",trControl=trainControl(method = "cv",number=5))
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
We can then compute the \(R^2\) between the observed and predicted values for both the test and train data:
comparison=c(testlin=R2(pred = predict(modelcv,d2test),obs = d2test$price),
trainlin=R2(pred = predict(modelcv,d2train),obs = d2train$price),
testint=R2(pred = predict(modelcvint,d2test),obs = d2test$price),
trainint=R2(pred = predict(modelcvint,d2train),obs = d2train$price)
)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
print(comparison)
## testlin trainlin testint trainint
## 0.9201718 0.9253244 0.4455863 0.9805768
Q: What is going on here with the results? Q: What is going on with the column headers? How might we make it prettier without bloating the code?
We can now go on to perform step-wise regression to minimise AIC (setting trace=FALSE to suppress the details of the steps):
modelcvstep <- train(price~.,
d2train,
method="lmStepAIC", # Step wise AIC
direction="both", # Forward-backward stepwise selection
trControl=trainControl(method = "cv",number=5),trace=FALSE)
summary(modelcvstep)
##
## Call:
## lm(formula = .outcome ~ carat + cut.L + cut.Q + cut.C + color.L +
## color.Q + color.C + `color^5` + clarity.L + clarity.Q + clarity.C +
## `clarity^5` + `clarity^7` + depth + x + y + z, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6245.7 -542.0 -174.9 336.4 9645.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4643.31 3186.40 -1.457 0.145252
## carat 12618.18 309.18 40.812 < 2e-16 ***
## cut.L 650.30 123.07 5.284 1.44e-07 ***
## cut.Q -259.05 102.80 -2.520 0.011837 *
## cut.C 235.37 85.59 2.750 0.006027 **
## color.L -2112.73 95.55 -22.110 < 2e-16 ***
## color.Q -776.86 84.93 -9.147 < 2e-16 ***
## color.C -304.72 78.72 -3.871 0.000113 ***
## `color^5` -157.95 68.34 -2.311 0.020950 *
## clarity.L 3821.46 161.19 23.708 < 2e-16 ***
## clarity.Q -1636.05 126.68 -12.915 < 2e-16 ***
## clarity.C 895.36 134.29 6.667 3.59e-11 ***
## `clarity^5` 307.10 85.34 3.599 0.000330 ***
## `clarity^7` 111.18 63.89 1.740 0.082021 .
## depth 110.98 52.43 2.117 0.034446 *
## x -1536.53 620.60 -2.476 0.013394 *
## y 2113.26 637.64 3.314 0.000940 ***
## z -3447.03 851.02 -4.050 5.36e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1066 on 1582 degrees of freedom
## Multiple R-squared: 0.9252, Adjusted R-squared: 0.9244
## F-statistic: 1151 on 17 and 1582 DF, p-value: < 2.2e-16
Running step-wise regression for all quadratic terms is a lot slower. Here we restrict interactions to only quantitative variables which reduces the search space enough to make the inference tolerable:
## Definition of a model we will refer to as **intstep**:
modelcvintstep <- train(price~color + clarity + (depth +table+x+y+z)^2,
d2train,
method="lmStepAIC", # Step wise AIC
direction="both", # Forward-backward stepwise selection
trControl=trainControl(method = "cv",number=5),trace=FALSE)
summary(modelcvintstep)
##
## Call:
## lm(formula = .outcome ~ color.L + color.Q + color.C + `color^5` +
## clarity.L + clarity.Q + clarity.C + `clarity^5` + `clarity^7` +
## depth + table + x + `depth:table` + `depth:x` + `depth:z` +
## `table:y` + `table:z` + `x:y` + `x:z` + `y:z`, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4622.4 -527.0 -163.5 336.2 9323.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -85777.123 27516.509 -3.117 0.001858 **
## color.L -2086.457 92.211 -22.627 < 2e-16 ***
## color.Q -790.498 82.180 -9.619 < 2e-16 ***
## color.C -298.028 75.959 -3.924 9.10e-05 ***
## `color^5` -195.818 65.926 -2.970 0.003020 **
## clarity.L 3799.973 155.106 24.499 < 2e-16 ***
## clarity.Q -1521.154 123.194 -12.348 < 2e-16 ***
## clarity.C 884.626 130.106 6.799 1.49e-11 ***
## `clarity^5` 305.176 82.395 3.704 0.000220 ***
## `clarity^7` 118.552 61.667 1.922 0.054730 .
## depth 1396.958 434.402 3.216 0.001327 **
## table 2971.518 511.833 5.806 7.74e-09 ***
## x -12835.131 5034.996 -2.549 0.010891 *
## `depth:table` -43.991 8.319 -5.288 1.41e-07 ***
## `depth:x` 310.184 70.298 4.412 1.09e-05 ***
## `depth:z` -300.541 79.442 -3.783 0.000161 ***
## `table:y` -581.260 67.881 -8.563 < 2e-16 ***
## `table:z` 869.942 112.814 7.711 2.19e-14 ***
## `x:y` 3256.282 799.134 4.075 4.83e-05 ***
## `x:z` -7655.040 1023.143 -7.482 1.21e-13 ***
## `y:z` 3926.503 755.665 5.196 2.30e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1029 on 1579 degrees of freedom
## Multiple R-squared: 0.9304, Adjusted R-squared: 0.9295
## F-statistic: 1056 on 20 and 1579 DF, p-value: < 2.2e-16
We can now compare all of our predictions:
comparisonall= c(comparison,
teststep=R2(pred = predict(modelcvstep,d2test),obs = d2test$price),
trainstep=R2(pred = predict(modelcvstep,d2train),obs = d2train$price),
testintstep=R2(pred = predict(modelcvintstep,d2test),obs = d2test$price),
trainintstep=R2(pred = predict(modelcvintstep,d2train),obs = d2train$price)
)
print(comparisonall)
## testlin trainlin testint trainint teststep trainstep
## 0.9201718 0.9253244 0.4455863 0.9805768 0.9201460 0.9251857
## testintstep trainintstep
## 0.9247944 0.9304239
about the interpretation of these results.
We now repeat the analysis with the temperatures data. Temperature already has 1997 rows so there is no need to downsample to have a comparable sized dataset.
set.seed(1)
## Downsample the data for computational convenience
## Make a test/train split
temperature2=na.omit(temperature[,c("Time","MA")])
temps=createDataPartition(1:dim(temperature2)[1],p=0.8,list=FALSE)
temptrain=temperature2[temps,]
temptest=temperature2[-temps,]
Now we can train models exactly as above:
## Learn a model on the training data, and use it to predict the test data
tempmodelcv <- train(MA ~ Time, data = temptrain, method = "lm",trControl=trainControl(method = "cv",number=5))
tempmodelcvint <- train(MA ~ poly(Time,5), data = temptrain, method = "lm",trControl=trainControl(method = "cv",number=5))
And compute out-of-sample \(R^2\):
tempcompare=c(testlin=R2(pred = predict(tempmodelcv,temptest),obs = temptest$MA),
trainlin=R2(pred = predict(tempmodelcv,temptrain),obs = temptrain$MA),
testint=R2(pred = predict(tempmodelcvint,temptest),obs = temptest$MA),
trainint=R2(pred = predict(tempmodelcvint,temptrain),obs = temptrain$MA)
)
tempcompare
## testlin trainlin testint trainint
## 0.7051755 0.6388462 0.7977764 0.7796500
And of course an advantage of a 1d problem is that we can visualise it. This plot is for the test data:
temptestlmpred <- predict(tempmodelcv,temptest)
temptestpolypred <- predict(tempmodelcvint,temptest)
plot(temptest$Time,temptest$MA,xlab="Date",ylab="Monthly Anomaly",pch=".")
lines(temptest$Time,temptestlmpred,col="red",lwd=2)
lines(temptest$Time,temptestpolypred,col="blue",lwd=2)
legend("topleft",legend=c("Data","Predicted (linear)","Predicted (polynomial)"),lty=c(NA,1,1),pch=c(".",NA,NA),col=c("black","red","blue"),text.col=c("black","red","blue"))
about the interpretation of these results.