This workshop will be updated in future years.
if(!require("finalfit")) install.packages("finalfit")
## Loading required package: finalfit
library("finalfit")
This workshop examines different approaches for coping with missing data. We’ll start with some basic QC, using SVD to explore the data, and then look into missingness.
Contrast the analysis here with that provided with the package at https://cran.r-project.org/web/packages/finalfit/vignettes/missing.html.
The data comes from trhe finalfit package and is available after the “library” call:
summary(colon_s)
## id rx sex age obstruct
## Min. : 1 Obs :315 Min. :0.000 Min. :18.00 Min. :0.0000
## 1st Qu.:233 Lev :310 1st Qu.:0.000 1st Qu.:53.00 1st Qu.:0.0000
## Median :465 Lev+5FU:304 Median :1.000 Median :61.00 Median :0.0000
## Mean :465 Mean :0.521 Mean :59.75 Mean :0.1938
## 3rd Qu.:697 3rd Qu.:1.000 3rd Qu.:69.00 3rd Qu.:0.0000
## Max. :929 Max. :1.000 Max. :85.00 Max. :1.0000
## NA's :21
## perfor adhere nodes status
## Min. :0.00000 Min. :0.0000 Min. : 0.00 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.: 1.00 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median : 2.00 Median :0.0000
## Mean :0.02906 Mean :0.1453 Mean : 3.66 Mean :0.4865
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.: 5.00 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000 Max. :33.00 Max. :1.0000
## NA's :18
## differ extent surg node4
## Min. :1.000 Min. :1.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :2.000 Median :3.000 Median :0.0000 Median :0.0000
## Mean :2.063 Mean :2.887 Mean :0.2675 Mean :0.2745
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :3.000 Max. :4.000 Max. :1.0000 Max. :1.0000
## NA's :23 NA's :17
## time sex.factor rx.factor obstruct.factor perfor.factor
## Min. : 23 Female:445 Obs :315 No :732 No :902
## 1st Qu.: 806 Male :484 Lev :310 Yes :176 Yes: 27
## Median :1976 Lev+5FU:304 NA's: 21
## Mean :1670
## 3rd Qu.:2364
## Max. :3329
##
## adhere.factor differ.factor extent.factor surg.factor
## No :794 Well : 93 Submucosa : 21 Short:668
## Yes:135 Moderate:663 Muscle :106 Long :244
## Poor :150 Serosa :759 NA's : 17
## NA's : 23 Adjacent structures: 43
##
##
##
## node4.factor status.factor age.factor loccomp loccomp.factor
## No :674 Alive:477 <40 years : 70 Min. :0.0000 No :616
## Yes:255 Died :452 40-59 years:344 1st Qu.:0.0000 Yes :293
## 60+ years :515 Median :0.0000 NA's: 20
## Mean :0.3223
## 3rd Qu.:1.0000
## Max. :1.0000
## NA's :20
## time.years mort_5yr age.10 mort_5yr.num hospital
## Min. :0.06301 Alive:511 Min. :1.800 Min. :1.000 hospital_1:186
## 1st Qu.:2.20822 Died :404 1st Qu.:5.300 1st Qu.:1.000 hospital_2:186
## Median :5.41370 NA's : 14 Median :6.100 Median :1.000 hospital_3:185
## Mean :4.57522 Mean :5.975 Mean :1.442 hospital_4:186
## 3rd Qu.:6.47671 3rd Qu.:6.900 3rd Qu.:2.000 hospital_5:186
## Max. :9.12055 Max. :8.500 Max. :2.000
## NA's :14
There is a lot of EDA work to do here. Be sure to check
?colon
. We’d like to predict survival, here “mort_5yr”.
Most fields are duplicated by a factor; lets get rid of this, including getting rid of the second “mort” variable we would like to predict, and do a basic missingness check:
colon = as.data.frame(colon_s)[,-1] # remove the "id" column
colon=colon[,-grep("factor",colnames(colon))]
colon=colon[,!colnames(colon)%in%"mort_5yr.num"] # keep mort_5yr
colon=colon[,c(1:12,14:16)]
table(is.na(colon))
##
## FALSE TRUE
## 13822 113
Missingness is at <1% in this data. Lets check how to manage it:
hist(rowSums(is.na(colon)),breaks=seq(-0.5,6.5),xlab="number of missing values",ylab="count")
It isn’t clear from this structure whether missingness is correlated. Is it predictable? We will test this by using a glm for whether any data were missing in a row, and augmenting the columns of the data with a new column “anymissing”.
allpresent=colnames(colon)[colSums(is.na(colon))==0] # columns with no missingness
allpresent=allpresent[allpresent!="anymissing"] # in case you run this twice!
colonmiss=is.na(colon) # a dataframe of missingness
colon$anymissing = as.numeric(rowSums(is.na(colon))>0)
colonmissingmodel=glm(paste("anymissing~",paste(allpresent,collapse="+"),collapse=""),
data=colon,family = binomial)
colonmissingpred=predict(colonmissingmodel,newdata=colon)
cor(colonmissingpred,colon$anymissing) # predictability of missingness
## [1] 0.1179642
We find a correlation between missingness and the complete data of 0.18 which implies that missingness is not strongly predictable from this (alone). Which variables are associated?
summary(colonmissingmodel)
##
## Call:
## glm(formula = paste("anymissing~", paste(allpresent, collapse = "+"),
## collapse = ""), family = binomial, data = colon)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.245e+00 1.141e+00 -1.967 0.0491 *
## rxLev 2.095e-03 2.867e-01 0.007 0.9942
## rxLev+5FU 3.436e-01 2.720e-01 1.263 0.2065
## sex 2.461e-01 2.264e-01 1.087 0.2769
## age 4.871e-05 9.457e-03 0.005 0.9959
## perfor -8.831e-01 1.034e+00 -0.854 0.3932
## adhere -3.817e-01 3.726e-01 -1.024 0.3057
## status -6.863e-01 4.363e-01 -1.573 0.1157
## extent 1.182e-01 2.462e-01 0.480 0.6310
## node4 5.579e-01 2.510e-01 2.223 0.0262 *
## time.years -8.350e-02 9.083e-02 -0.919 0.3579
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 591.16 on 928 degrees of freedom
## Residual deviance: 577.58 on 918 degrees of freedom
## AIC: 599.58
##
## Number of Fisher Scoring iterations: 5
The hospital
and node4
variables are
associated, but little else.
We can therefore assume a missing (mostly) at random design to test predictability of the inferred values.
We note that colnames(colon)
include variables that we
expect to be strongly correlated, such as nodes
and
node4
. That is a problem for primary inference but is a
benefit for imputation…!
about QC concerns we could look into further.
We will start by making an SVD to explore the data. This will be a complete case analysis for simplicity.
tcolon=na.omit(colon)[,c("sex","age","obstruct","perfor","adhere","nodes","status","differ","extent","surg","node4","loccomp","time.years")]
tlabels=as.numeric(na.omit(colon)[,"mort_5yr"])
tsvd=svd(scale(tcolon))
plot(tsvd$u[,1:2],pch=19,col=tlabels,xlab="PC1",ylab="PC2")
legend("topleft",col=1:2,text.col=1:2,pch=19,legend=c("Survived","Died"))
This looks pretty interesting. It implies that there is a strong ability to tell between the two classes, but there is structure in the data that is not strongly informative about that.
## Feature plot
plot(tsvd$v[,c(1,2)],type="n",xlab="PC1",ylab="PC2")
text(tsvd$v[,c(1,2)],labels = colnames(tcolon),cex=0.5)
A few of these are essentially at the same location. These are:
tdist=as.matrix(dist(tsvd$v[,1:2]))
diag(tdist)=max(tdist)
thresh=0.05
for(x in 1:dim(tdist)[1]){
if(any(tdist[x,]<thresh)) print(paste(c(colnames(tcolon)[x],colnames(tcolon)[tdist[x,]<thresh]),collapse=","))
}
## [1] "nodes,node4"
## [1] "node4,nodes"
This is interesting and a sign of duplication, but not a sign that we’ve done something bad in the choice of features to retain.
about interpreting and analysing the SVD.
Lets build regression-based imputation models for each missing variable:
missingcols=colnames(colon)[colSums(is.na(colon))>0] # columns with no missingness
train=sample(1:dim(colon)[1])[1:700] # training data
test=(1:dim(colon)[1])[!(1:dim(colon)[1])%in%train] # 229 reserved for training...
#sapply(missingcols,function(x)length(table(colon[ttest,x])))
## correlation scores
allscores=c() #
allclasses=c() # correlation classes
colonimp=colon
for(v in missingcols){
ttest=test[!is.na(colon[test,v])] # get the subset of available values
tmiss=which(is.na(colon[,v])) # A list of missing data in this column
isbinary=length(table(colon[ttest,v]))==2 # whether this variable is a binary variable or not
if(isbinary){
tmodel=glm(paste(v,"~",paste(allpresent,collapse="+"),collapse=""),
data=colon[train,c(v,allpresent)],family = binomial)
}else{
tmodel=lm(paste(v,"~",paste(allpresent,collapse="+"),collapse=""),
data=colon[train,c(v,allpresent)])
}
## glm will automatically omit missing data in the training data
tpred=predict(tmodel,newdata=colon[ttest,],type="response") # For performance evaluation, predicting the test data
tscore=cor(as.numeric(colon[ttest,v]),tpred) ## We will use correlation as a score
names(isbinary)=names(tscore)=v # annoyingly R loses the names so we put the name on manually
print(paste("Correlation between observed and prediction for column",v,"=",tscore))
allscores=c(allscores,tscore)
allclasses=c(allclasses,isbinary)
## Now we have scores, we'll apply the prediction to the unseen data
tpredimpute=predict(tmodel,newdata=colon[tmiss,,drop=FALSE],type="response") # For imputation
if(tscore>0.6){ # If imputation is good enough, use it
if(isbinary){
if(is.factor(colonimp[,v])){
colonimp[tmiss,v]=levels(colonimp[tmiss,v])[1+round(tpredimpute)]
}else {
colonimp[tmiss,v]=round(tpredimpute)
}
}else{
colonimp[tmiss,v]=tpredimpute
}
}
}
## [1] "Correlation between observed and prediction for column obstruct = 0.107357362520901"
## [1] "Correlation between observed and prediction for column nodes = 0.739126388212164"
## [1] "Correlation between observed and prediction for column differ = 0.166604781790603"
## [1] "Correlation between observed and prediction for column surg = -0.0213258128151301"
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "Correlation between observed and prediction for column loccomp = 0.687417934554878"
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "Correlation between observed and prediction for column mort_5yr = 0.990985376138791"
We see we get plenty of warnings, but several variables are relatively well predicted.
c("cols with missingness"=sum(colSums(apply(colon,2,is.na))>0),
"after imputation"=sum(colSums(apply(colonimp,2,is.na))>0))
## cols with missingness after imputation
## 6 3
Its worth noting that one of the variables we predicted was
mort_5yr
; this a) bodes well for prediction, and b) could
be dangerous if we use imputed values to truth our findings!
about imputation.
We now make models predicting mortality:
modelunimputed=glm(mort_5yr~.,data=colon[train,],family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
modelimputed=glm(mort_5yr~.,data=colonimp[train,],family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
predunimputed=predict(modelunimputed,newdata=colon[test,],type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
predimputed=predict(modelimputed,newdata=colonimp[test,],type="response")
This is how to examine the model structure:
summary(modelunimputed)
##
## Call:
## glm(formula = mort_5yr ~ ., family = binomial, data = colon[train,
## ])
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.379e+02 1.430e+05 0.002 0.999
## rxLev -1.956e+01 1.981e+04 -0.001 0.999
## rxLev+5FU -2.304e+01 4.670e+04 0.000 1.000
## sex -9.090e+00 1.141e+04 -0.001 0.999
## age -1.150e-01 1.092e+03 0.000 1.000
## obstruct -5.411e+00 1.491e+05 0.000 1.000
## perfor -2.615e+01 2.164e+06 0.000 1.000
## adhere 1.175e+01 1.605e+05 0.000 1.000
## nodes 2.184e+00 2.851e+03 0.001 0.999
## status 4.020e+01 1.879e+04 0.002 0.998
## differ -1.402e+01 2.389e+04 -0.001 1.000
## extent 1.882e+01 1.605e+04 0.001 0.999
## surg -1.842e+01 1.523e+04 -0.001 0.999
## node4 1.195e+01 3.325e+04 0.000 1.000
## loccomp -3.546e+01 1.859e+05 0.000 1.000
## time.years -5.610e+01 1.033e+04 -0.005 0.996
## anymissing NA NA NA NA
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8.6417e+02 on 629 degrees of freedom
## Residual deviance: 8.7165e-08 on 614 degrees of freedom
## (70 observations deleted due to missingness)
## AIC: 32
##
## Number of Fisher Scoring iterations: 25
Note that no p-values are significant, and 3 coefficients are not defined due to singularities.
Finally we can evaluate the predictive performance. We can do this via correlation:
tres=na.omit(cbind(obs=as.numeric(colon[test,"mort_5yr"])-1,
predunimputed=predunimputed,
predimputed=predimputed))
cor(tres)
## obs predunimputed predimputed
## obs 1.0000000 0.9722598 0.9903956
## predunimputed 0.9722598 1.0000000 0.9631815
## predimputed 0.9903956 0.9631815 1.0000000
Or via “receiver operator curves” which we examine properly in the next two blocks:
require("pROC")
## Loading required package: pROC
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
unimputedroc=roc(obs~predunimputed,data=as.data.frame(tres))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
imputedroc=roc(obs~predimputed,data=as.data.frame(tres))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
cor(na.omit(tres))
## obs predunimputed predimputed
## obs 1.0000000 0.9722598 0.9903956
## predunimputed 0.9722598 1.0000000 0.9631815
## predimputed 0.9903956 0.9631815 1.0000000
plot(1-unimputedroc$specificities,unimputedroc$sensitivities,type="l",
xlab="False Positive Rate",ylab="True Positive Rate",xlim=c(0,0.05),col=1)
lines(1-imputedroc$specificities,imputedroc$sensitivities,col=2)
In both measures, the performance is very good but imputation made the prediction worse.
about overall results and interpretation.