# 02.3 Regression and Cross-Validation Workshop Here we will consider two datasets that exhibit different structural features and therefore different implications for Cross-Validation. ## 0. Requirements Its good practice to get the requirements right at the top. The following solution checks for the requirements, and then installs them if they are not present. ```{r} if(!require("ggplot2")) install.packages("ggplot2") if(!require("caret")) install.packages("caret") ``` ## 1. Data ### 1.1 Diamonds Dataset The Diamonds Dataset is a classic dataset for over 50k diamonds, describing their price and features. You can read more about it in many places, for example [Kaggle](https://www.kaggle.com/datasets/shivam2503/diamonds) (though it did not originate there). Its built into the `ggplot2` package in R and can be accessed directly: ```{r} library("ggplot2") # Ironically we're using this for the data, and not for the plotting! head(diamonds) ``` The ordinal features are "cut", "color" and "clarity", with "x" "y" and "z" describing physical dimensions and "depth" and "table" being further shape characteristics relating those physical dimensions (expressed as percentages). There are some data issues which we do not deal with fully, but can partially address by removing any data with impossible spatial measurements: ```{r} d2=data.frame(diamonds) numericcols=c(1,5:10) catcols=2:4 ## Perform a minimal data sanitization step d2=d2[-which((d2$x==0)|(d2$y==0)|(d2$z==0)|(d2$y>15)|(d2$z>15)),] plot(d2[,numericcols],pch=19,col="#00000033",cex=0.7) ``` The Diamonds dataset is feature rich and may be expected to have independence between observations. ### 1.2 Berkeley Earth Temperature Record The Earth Temperature record shows the famous "hockey stick" of global warming, containing only the high-resolution data from 1850 onwards. It contains multiple timescales of both land and ocean global temperatures, and contains uncertainty. Reference: Rohde, R. A. and Hausfather, Z.: The Berkeley Earth Land/Ocean Temperature Record, Earth Syst. Sci. Data, 12, 3469-3479, https://doi.org/10.5194/essd-12-3469-2020, 2020. Because this dataset is small, it can be accessed directly from the internet, with no need to download locally. The format is slightly confusing with separate blocks of content for land and ocean, which can be worked around by just reading the correct lines for the block we are interested in. These are the land temperatures measured with respect to an 1850 baseline: ```{r} # Monthly Annual Five-year Ten-year Twenty-year # Year, Month, Anomaly, Unc., Anomaly, Unc., Anomaly, Unc., Anomaly, Unc., Anomaly, Unc.ß temperature=read.table("https://berkeley-earth-temperature.s3.us-west-1.amazonaws.com/Global/Land_and_Ocean_complete.txt",skip = 86,nrows = 1997) colnames(temperature)=c("Year","Month","MA","MACI","AA","AACI","A5","A5CI","A10","A10CI","A20","A20CI") temperature$Time=temperature$Year+(temperature$Month-1)/12 plot(temperature$Time,temperature$MA,xlab="Date",ylab="Monthly Anomaly") ``` Although there is considerable heterogeneity in the confidence in the monthly estimates, due to reduced data back in time and structural variation in the signal, we will not deal with that in this workshop. Here are the confidence interval sizes: ```{r} plot(temperature$Time,temperature$MACI,xlab="Date",ylab="Monthly Anomaly Confidence",ylim=c(0,0.55)) ``` The organisation separated out "Year" and "Month" so we create a continuous time variable called "Time" combining them. Each average (Month, Year, 5 Year, 10 Year and 20 Year) comes with a corresponding confidence 95% interval. In contrast to the Diamonds data, there are no covariates and no expectation of independence between the observations. ## 2. Exploratory analyses As an initial exploration we will run some linear models using all of the data, to establish whether there is power and what broad signals are present. ### 2.1 Diamonds We will try to predict price, because of course...! We start with a regression using the continuous variables. ```{r} dlm2=lm(price~carat + depth + table +x+y+z,data=d2) summary(dlm2) ``` R reports the regression estimates and the p-value for each parameter marginally, as well as the F-statistic which quantifies whether the combined covariates have predicted power at all. We now examine the complete regression of price against all other covariates. ```{r} dlm=lm(price~.,data=d2) summary(dlm) ``` To interpret this, it is essential to understand how categories are treated. * R implicitly **one-hot encodes** categorical variables, i.e. constructs a column for each category, containing value 1 if the observation took that category and 0 otherwise. * However, these are *ordered* factors (check using `class(d2$clarity)`), which get special treatment (see https://library.virginia.edu/data/articles/understanding-ordered-factors-in-a-linear-model), using a *linear*, *quadratic*, *cubic*, etc model on the *order*. There are always *n-1* powers included for *n* categories. Here is the order for the *cut* variable: ```{r} table(d2$cut) ``` #### See **Question B2W1** in Block 2 Portfolio about the interpretation of these results. ### 2.2 Temperatures We first check whether there is a significant change over time using a standard regression: ```{r} templm=lm(MA~Time,data=temperature) summary(templm) ``` Of course the answer is yes. However, the data were clearly non-linear. Lets now fit a polynomial to the data: ```{r} head(temperature) temppoly=lm(MA~poly(Time,5),data=temperature) summary(temppoly) ``` Perhaps unsurprisingly, all terms are significant, though the symmetric 4-th degree term has lower significance. ```{r} templmpred <- predict(templm,temperature) temppolypred <- predict(temppoly,temperature) plot(temperature$Time,temperature$MA,xlab="Date",ylab="Monthly Anomaly",pch=19,cex=0.5,col="grey") lines(temperature$Time,templmpred,col="red",lwd=2) lines(temperature$Time,temppolypred,col="blue",lwd=2) legend("topleft",legend=c("Data","Predicted (linear)","Predicted (polynomial)"),lty=c(NA,1,1),pch=c(19,NA,NA),col=c("grey","red","blue"),text.col=c("grey","red","blue")) ``` #### See **Question B2W2** in Block 2 Portfolio about the interpretation of these results. ## 3. Out of sample prediction We will use the library "caret" for cross-validation training and control. This is not strictly necessary, but provides some helpful features in the computation of AIC and performing *stepwise* regression to search model space. ```{r} library(caret) ``` ### 3.1 Diamonds We will perform the following cross-validation tests on a very small subset of the diamonds dataset. You are encouraged to repeat these analyses on the full dataset. Before you do, think about what you predict will change when we have 50k samples instead of 2k? First construct a test/train split: ```{r} set.seed(2) ## Downsample the data for computational convenience mysamples=sample(dim(d2)[1],2000) smalld2=d2[mysamples,] ## Make a test/train split s=createDataPartition(1:dim(smalld2)[1],p=0.8,list=FALSE) d2train=smalld2[s,] d2test=smalld2[-s,] ``` Now we perform training. The Caret package distinguishes between three classes of data: * Training * Validation * Testing To do this efficiently, it uses K-fold CV to train on K-1 folds and validate on 1 fold of the training data. It can loop over all choices of validation fold, increasing the compute K-times but reducing sampling variability. The Testing data is then completely unpolluted by the training step. Validation data is only needed if there is some hyper parameter to be fit, but it also can be of use in regression for model (i.e. variable) selection. Here is how we can train a linear and a quadratic model this way: ```{r} ## Learn a model on the training data, and use it to predict the test data ## Definition of a linear model we will refer to as **lin**: modelcv <- train(price ~ ., data = d2train, method = "lm",trControl=trainControl(method = "cv",number=5)) ## Definition of an interaction model we will refer to as **int**: modelcvint <- train(price ~ .^2, data = d2train, method = "lm",trControl=trainControl(method = "cv",number=5)) ``` We can then compute the $R^2$ between the observed and predicted values for both the test and train data: ```{r} comparison=c(testlin=R2(pred = predict(modelcv,d2test),obs = d2test$price), trainlin=R2(pred = predict(modelcv,d2train),obs = d2train$price), testint=R2(pred = predict(modelcvint,d2test),obs = d2test$price), trainint=R2(pred = predict(modelcvint,d2train),obs = d2train$price) ) print(comparison) ``` We can now go on to perform step-wise regression to minimise AIC (setting trace=FALSE to suppress the details of the steps): ```{r} modelcvstep <- train(price~., d2train, method="lmStepAIC", # Step wise AIC direction="both", # Forward-backward stepwise selection trControl=trainControl(method = "cv",number=5),trace=FALSE) summary(modelcvstep) ``` Running step-wise regression for all quadratic terms is a **lot** slower. Here we restrict interactions to only quantitative variables which reduces the search space enough to make the inference tolerable: ```{r} ## Definition of a model we will refer to as **intstep**: modelcvintstep <- train(price~color + clarity + (depth +table+x+y+z)^2, d2train, method="lmStepAIC", # Step wise AIC direction="both", # Forward-backward stepwise selection trControl=trainControl(method = "cv",number=5),trace=FALSE) summary(modelcvintstep) ``` We can now compare all of our predictions: ```{r} comparisonall= c(comparison, teststep=R2(pred = predict(modelcvstep,d2test),obs = d2test$price), trainstep=R2(pred = predict(modelcvstep,d2train),obs = d2train$price), testintstep=R2(pred = predict(modelcvintstep,d2test),obs = d2test$price), trainintstep=R2(pred = predict(modelcvintstep,d2train),obs = d2train$price) ) print(comparisonall) ``` #### See **Question B2W3** in Block 2 Portfolio about the interpretation of these results. ### 3.2 Temperatures Cross Validation We now repeat the analysis with the temperatures data. Temperature already has 1997 rows so there is no need to downsample to have a comparable sized dataset. ```{r} set.seed(1) ## Downsample the data for computational convenience ## Make a test/train split temperature2=na.omit(temperature[,c("Time","MA")]) temps=createDataPartition(1:dim(temperature2)[1],p=0.8,list=FALSE) temptrain=temperature2[temps,] temptest=temperature2[-temps,] ``` Now we can train models exactly as above: ```{r} ## Learn a model on the training data, and use it to predict the test data tempmodelcv <- train(MA ~ Time, data = temptrain, method = "lm",trControl=trainControl(method = "cv",number=5)) tempmodelcvint <- train(MA ~ poly(Time,5), data = temptrain, method = "lm",trControl=trainControl(method = "cv",number=5)) ``` And compute out-of-sample $R^2$: ```{r} tempcompare=c(testlin=R2(pred = predict(tempmodelcv,temptest),obs = temptest$MA), trainlin=R2(pred = predict(tempmodelcv,temptrain),obs = temptrain$MA), testint=R2(pred = predict(tempmodelcvint,temptest),obs = temptest$MA), trainint=R2(pred = predict(tempmodelcvint,temptrain),obs = temptrain$MA) ) tempcompare ``` And of course an advantage of a 1d problem is that we can visualise it. This plot is for the test data: ```{r} temptestlmpred <- predict(tempmodelcv,temptest) temptestpolypred <- predict(tempmodelcvint,temptest) plot(temptest$Time,temptest$MA,xlab="Date",ylab="Monthly Anomaly",pch=".") lines(temptest$Time,temptestlmpred,col="red",lwd=2) lines(temptest$Time,temptestpolypred,col="blue",lwd=2) legend("topleft",legend=c("Data","Predicted (linear)","Predicted (polynomial)"),lty=c(NA,1,1),pch=c(".",NA,NA),col=c("black","red","blue"),text.col=c("black","red","blue")) ``` #### See **Question B2W4** in Block 2 Portfolio about the interpretation of these results.