0. Requirements

This workshop will be updated in future years.

if(!require("finalfit")) install.packages("finalfit")
## Loading required package: finalfit
library("finalfit")

1. Overview and data

This workshop examines different approaches for coping with missing data. We’ll start with some basic QC, using SVD to explore the data, and then look into missingness.

Contrast the analysis here with that provided with the package at https://cran.r-project.org/web/packages/finalfit/vignettes/missing.html.

1.1 Data input

The data comes from trhe finalfit package and is available after the “library” call:

summary(colon_s)
##        id            rx           sex             age           obstruct     
##  Min.   :  1   Obs    :315   Min.   :0.000   Min.   :18.00   Min.   :0.0000  
##  1st Qu.:233   Lev    :310   1st Qu.:0.000   1st Qu.:53.00   1st Qu.:0.0000  
##  Median :465   Lev+5FU:304   Median :1.000   Median :61.00   Median :0.0000  
##  Mean   :465                 Mean   :0.521   Mean   :59.75   Mean   :0.1938  
##  3rd Qu.:697                 3rd Qu.:1.000   3rd Qu.:69.00   3rd Qu.:0.0000  
##  Max.   :929                 Max.   :1.000   Max.   :85.00   Max.   :1.0000  
##                                                              NA's   :21      
##      perfor            adhere           nodes           status      
##  Min.   :0.00000   Min.   :0.0000   Min.   : 0.00   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.: 1.00   1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000   Median : 2.00   Median :0.0000  
##  Mean   :0.02906   Mean   :0.1453   Mean   : 3.66   Mean   :0.4865  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.: 5.00   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :33.00   Max.   :1.0000  
##                                     NA's   :18                      
##      differ          extent           surg            node4       
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:3.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :2.000   Median :3.000   Median :0.0000   Median :0.0000  
##  Mean   :2.063   Mean   :2.887   Mean   :0.2675   Mean   :0.2745  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :4.000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :23                      NA's   :17                       
##       time       sex.factor    rx.factor   obstruct.factor perfor.factor
##  Min.   :  23   Female:445   Obs    :315   No  :732        No :902      
##  1st Qu.: 806   Male  :484   Lev    :310   Yes :176        Yes: 27      
##  Median :1976                Lev+5FU:304   NA's: 21                     
##  Mean   :1670                                                           
##  3rd Qu.:2364                                                           
##  Max.   :3329                                                           
##                                                                         
##  adhere.factor  differ.factor             extent.factor surg.factor
##  No :794       Well    : 93   Submucosa          : 21   Short:668  
##  Yes:135       Moderate:663   Muscle             :106   Long :244  
##                Poor    :150   Serosa             :759   NA's : 17  
##                NA's    : 23   Adjacent structures: 43              
##                                                                    
##                                                                    
##                                                                    
##  node4.factor status.factor       age.factor     loccomp       loccomp.factor
##  No :674      Alive:477     <40 years  : 70   Min.   :0.0000   No  :616      
##  Yes:255      Died :452     40-59 years:344   1st Qu.:0.0000   Yes :293      
##                             60+ years  :515   Median :0.0000   NA's: 20      
##                                               Mean   :0.3223                 
##                                               3rd Qu.:1.0000                 
##                                               Max.   :1.0000                 
##                                               NA's   :20                     
##    time.years       mort_5yr       age.10       mort_5yr.num         hospital  
##  Min.   :0.06301   Alive:511   Min.   :1.800   Min.   :1.000   hospital_1:186  
##  1st Qu.:2.20822   Died :404   1st Qu.:5.300   1st Qu.:1.000   hospital_2:186  
##  Median :5.41370   NA's : 14   Median :6.100   Median :1.000   hospital_3:185  
##  Mean   :4.57522               Mean   :5.975   Mean   :1.442   hospital_4:186  
##  3rd Qu.:6.47671               3rd Qu.:6.900   3rd Qu.:2.000   hospital_5:186  
##  Max.   :9.12055               Max.   :8.500   Max.   :2.000                   
##                                                NA's   :14

There is a lot of EDA work to do here. Be sure to check ?colon. We’d like to predict survival, here “mort_5yr”.

1.2 QC

Most fields are duplicated by a factor; lets get rid of this, including getting rid of the second “mort” variable we would like to predict, and do a basic missingness check:

colon = as.data.frame(colon_s)[,-1] # remove the "id" column
colon=colon[,-grep("factor",colnames(colon))]
colon=colon[,!colnames(colon)%in%"mort_5yr.num"] # keep mort_5yr
colon=colon[,c(1:12,14:16)]
table(is.na(colon))
## 
## FALSE  TRUE 
## 13822   113

Missingness is at <1% in this data. Lets check how to manage it:

hist(rowSums(is.na(colon)),breaks=seq(-0.5,6.5),xlab="number of missing values",ylab="count")

1.3 Predicting missingness

It isn’t clear from this structure whether missingness is correlated. Is it predictable? We will test this by using a glm for whether any data were missing in a row, and augmenting the columns of the data with a new column “anymissing”.

allpresent=colnames(colon)[colSums(is.na(colon))==0] # columns with no missingness
allpresent=allpresent[allpresent!="anymissing"] # in case you run this twice!
colonmiss=is.na(colon) # a dataframe of missingness
colon$anymissing = as.numeric(rowSums(is.na(colon))>0)
colonmissingmodel=glm(paste("anymissing~",paste(allpresent,collapse="+"),collapse=""),
                      data=colon,family = binomial)
colonmissingpred=predict(colonmissingmodel,newdata=colon)
cor(colonmissingpred,colon$anymissing) # predictability of missingness
## [1] 0.1179642

We find a correlation between missingness and the complete data of 0.18 which implies that missingness is not strongly predictable from this (alone). Which variables are associated?

summary(colonmissingmodel)
## 
## Call:
## glm(formula = paste("anymissing~", paste(allpresent, collapse = "+"), 
##     collapse = ""), family = binomial, data = colon)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -2.245e+00  1.141e+00  -1.967   0.0491 *
## rxLev        2.095e-03  2.867e-01   0.007   0.9942  
## rxLev+5FU    3.436e-01  2.720e-01   1.263   0.2065  
## sex          2.461e-01  2.264e-01   1.087   0.2769  
## age          4.871e-05  9.457e-03   0.005   0.9959  
## perfor      -8.831e-01  1.034e+00  -0.854   0.3932  
## adhere      -3.817e-01  3.726e-01  -1.024   0.3057  
## status      -6.863e-01  4.363e-01  -1.573   0.1157  
## extent       1.182e-01  2.462e-01   0.480   0.6310  
## node4        5.579e-01  2.510e-01   2.223   0.0262 *
## time.years  -8.350e-02  9.083e-02  -0.919   0.3579  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 591.16  on 928  degrees of freedom
## Residual deviance: 577.58  on 918  degrees of freedom
## AIC: 599.58
## 
## Number of Fisher Scoring iterations: 5

The hospital and node4 variables are associated, but little else.

We can therefore assume a missing (mostly) at random design to test predictability of the inferred values.

We note that colnames(colon) include variables that we expect to be strongly correlated, such as nodes and node4. That is a problem for primary inference but is a benefit for imputation…!

See Q1 in Block 4 Portfolio

about QC concerns we could look into further.

2. SVD

We will start by making an SVD to explore the data. This will be a complete case analysis for simplicity.

tcolon=na.omit(colon)[,c("sex","age","obstruct","perfor","adhere","nodes","status","differ","extent","surg","node4","loccomp","time.years")]
tlabels=as.numeric(na.omit(colon)[,"mort_5yr"])
tsvd=svd(scale(tcolon))
plot(tsvd$u[,1:2],pch=19,col=tlabels,xlab="PC1",ylab="PC2") 
legend("topleft",col=1:2,text.col=1:2,pch=19,legend=c("Survived","Died"))

This looks pretty interesting. It implies that there is a strong ability to tell between the two classes, but there is structure in the data that is not strongly informative about that.

## Feature plot
plot(tsvd$v[,c(1,2)],type="n",xlab="PC1",ylab="PC2") 
text(tsvd$v[,c(1,2)],labels = colnames(tcolon),cex=0.5)

A few of these are essentially at the same location. These are:

tdist=as.matrix(dist(tsvd$v[,1:2]))
diag(tdist)=max(tdist)
thresh=0.05
for(x in 1:dim(tdist)[1]){
  if(any(tdist[x,]<thresh)) print(paste(c(colnames(tcolon)[x],colnames(tcolon)[tdist[x,]<thresh]),collapse=","))
}
## [1] "nodes,node4"
## [1] "node4,nodes"

This is interesting and a sign of duplication, but not a sign that we’ve done something bad in the choice of features to retain.

See Q2 in Block 4 Portfolio

about interpreting and analysing the SVD.

3. Imputation modelling

3.1 Building an imputation model

Lets build regression-based imputation models for each missing variable:

missingcols=colnames(colon)[colSums(is.na(colon))>0] # columns with no missingness
train=sample(1:dim(colon)[1])[1:700] # training data
test=(1:dim(colon)[1])[!(1:dim(colon)[1])%in%train] # 229 reserved for training...
#sapply(missingcols,function(x)length(table(colon[ttest,x])))

## correlation scores
allscores=c() #
allclasses=c() # correlation classes
colonimp=colon
for(v in missingcols){
  ttest=test[!is.na(colon[test,v])] # get the subset of available values
  tmiss=which(is.na(colon[,v])) # A list of missing data in this column
  isbinary=length(table(colon[ttest,v]))==2 # whether this variable is a binary variable or not
  if(isbinary){
      tmodel=glm(paste(v,"~",paste(allpresent,collapse="+"),collapse=""),
                      data=colon[train,c(v,allpresent)],family = binomial)
  }else{
    tmodel=lm(paste(v,"~",paste(allpresent,collapse="+"),collapse=""),
                      data=colon[train,c(v,allpresent)]) 
  }
  ## glm will automatically omit missing data in the training data
  tpred=predict(tmodel,newdata=colon[ttest,],type="response") # For performance evaluation, predicting the test data
  tscore=cor(as.numeric(colon[ttest,v]),tpred) ## We will use correlation as a score
  names(isbinary)=names(tscore)=v # annoyingly R loses the names so we put the name on manually 
  print(paste("Correlation between observed and prediction for column",v,"=",tscore))
  allscores=c(allscores,tscore)
  allclasses=c(allclasses,isbinary)
  ## Now we have scores, we'll apply the prediction to the unseen data
  tpredimpute=predict(tmodel,newdata=colon[tmiss,,drop=FALSE],type="response") # For imputation
  if(tscore>0.6){ # If imputation is good enough, use it
    if(isbinary){
      if(is.factor(colonimp[,v])){ 
        colonimp[tmiss,v]=levels(colonimp[tmiss,v])[1+round(tpredimpute)]
      }else {
        colonimp[tmiss,v]=round(tpredimpute)
      }
    }else{
      colonimp[tmiss,v]=tpredimpute
    }
  }
}
## [1] "Correlation between observed and prediction for column obstruct = 0.107357362520901"
## [1] "Correlation between observed and prediction for column nodes = 0.739126388212164"
## [1] "Correlation between observed and prediction for column differ = 0.166604781790603"
## [1] "Correlation between observed and prediction for column surg = -0.0213258128151301"
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "Correlation between observed and prediction for column loccomp = 0.687417934554878"
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "Correlation between observed and prediction for column mort_5yr = 0.990985376138791"

We see we get plenty of warnings, but several variables are relatively well predicted.

c("cols with missingness"=sum(colSums(apply(colon,2,is.na))>0),
  "after imputation"=sum(colSums(apply(colonimp,2,is.na))>0))
## cols with missingness      after imputation 
##                     6                     3

Its worth noting that one of the variables we predicted was mort_5yr; this a) bodes well for prediction, and b) could be dangerous if we use imputed values to truth our findings!

See Q3 in Block 4 Portfolio

about imputation.

3.2 Imputation model before and after

We now make models predicting mortality:

modelunimputed=glm(mort_5yr~.,data=colon[train,],family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
modelimputed=glm(mort_5yr~.,data=colonimp[train,],family = binomial)
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
predunimputed=predict(modelunimputed,newdata=colon[test,],type="response") 
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
predimputed=predict(modelimputed,newdata=colonimp[test,],type="response") 

This is how to examine the model structure:

summary(modelunimputed)
## 
## Call:
## glm(formula = mort_5yr ~ ., family = binomial, data = colon[train, 
##     ])
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)  2.379e+02  1.430e+05   0.002    0.999
## rxLev       -1.956e+01  1.981e+04  -0.001    0.999
## rxLev+5FU   -2.304e+01  4.670e+04   0.000    1.000
## sex         -9.090e+00  1.141e+04  -0.001    0.999
## age         -1.150e-01  1.092e+03   0.000    1.000
## obstruct    -5.411e+00  1.491e+05   0.000    1.000
## perfor      -2.615e+01  2.164e+06   0.000    1.000
## adhere       1.175e+01  1.605e+05   0.000    1.000
## nodes        2.184e+00  2.851e+03   0.001    0.999
## status       4.020e+01  1.879e+04   0.002    0.998
## differ      -1.402e+01  2.389e+04  -0.001    1.000
## extent       1.882e+01  1.605e+04   0.001    0.999
## surg        -1.842e+01  1.523e+04  -0.001    0.999
## node4        1.195e+01  3.325e+04   0.000    1.000
## loccomp     -3.546e+01  1.859e+05   0.000    1.000
## time.years  -5.610e+01  1.033e+04  -0.005    0.996
## anymissing          NA         NA      NA       NA
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.6417e+02  on 629  degrees of freedom
## Residual deviance: 8.7165e-08  on 614  degrees of freedom
##   (70 observations deleted due to missingness)
## AIC: 32
## 
## Number of Fisher Scoring iterations: 25

Note that no p-values are significant, and 3 coefficients are not defined due to singularities.

Finally we can evaluate the predictive performance. We can do this via correlation:

tres=na.omit(cbind(obs=as.numeric(colon[test,"mort_5yr"])-1,
           predunimputed=predunimputed,
           predimputed=predimputed))
cor(tres)
##                     obs predunimputed predimputed
## obs           1.0000000     0.9722598   0.9903956
## predunimputed 0.9722598     1.0000000   0.9631815
## predimputed   0.9903956     0.9631815   1.0000000

Or via “receiver operator curves” which we examine properly in the next two blocks:

require("pROC")
## Loading required package: pROC
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
unimputedroc=roc(obs~predunimputed,data=as.data.frame(tres))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
imputedroc=roc(obs~predimputed,data=as.data.frame(tres))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
cor(na.omit(tres))
##                     obs predunimputed predimputed
## obs           1.0000000     0.9722598   0.9903956
## predunimputed 0.9722598     1.0000000   0.9631815
## predimputed   0.9903956     0.9631815   1.0000000
plot(1-unimputedroc$specificities,unimputedroc$sensitivities,type="l",
         xlab="False Positive Rate",ylab="True Positive Rate",xlim=c(0,0.05),col=1)
lines(1-imputedroc$specificities,imputedroc$sensitivities,col=2)

In both measures, the performance is very good but imputation made the prediction worse.

See Q4 in Block 4 Portfolio

about overall results and interpretation.