Missingness

Missingness is a fact of life in data analysis. It can arise for many reasons, including:

People not answering questions in surveys
Data corruption
Data not being recorded
Data not being available (e.g. due to privacy or commercial sensitivity)

The critical question for us is to establish how large the problem is, and whether it is likely to bias our results. How we deal with it depends on the nature of the missingness and the size of the dataset.

In this example we focus on multiple imputation, which is a principled way to deal with missingness, but only appropriate for small-to-medium sized datasets. The worksheet gets far enough to show you how to implement an analysis in the presence of missing data, but does not go into depth about how to evaluate the results. This is left as an exercise for Portfolio 4.

Background on Missing Data

Following Andrew Gelman’s notes:

Types of Missingness

The standard summary of the types of missingness that can occur in data, and how they impact analysis, are:

Missingness completely at random (MCAR). Missingness is independent of everything. For example, if each survey respondent decides whether to answer the “earnings” question by rolling a die and refusing to answer if a “6” shows up. Discarding cases with missing data in MCAR does not bias your inferences.
Missingness at random (MAR). The probability a variable is missing depends only on available information. Thus, if sex, race, education, and age are recorded for all the people in the survey, then “earnings” is missing at random if the probability of nonresponse to this question depends only on these fully recorded variables. In MAR data “nonresponse bias” can thus be avoided by modelling the missingness using the observed predictors.

Not Missing at Random breaks down into:

Missingness that depends on unobserved predictors. In the example above, if age affects earnings and was not recorded then data are not missing at random. It could in principle be modelled, if we were able to learn a suitable model to predict age but this either requires strong assumptions or further data.
Missingness that depends on the missing value itself. This is known as censorin. Examples might include people with high earnings being less likely to report them (for tax, embarrassment, etc reasons), or survival study (e.g. for HIV treatments) we cannot know about people who will go on to die of HIV in the future. Again, models or robust statistics are needed to make progress.

Methods

Complete-case analysis involves discarding records that contain missingness. It is accurate (but increases variance of estimators) for MCAR data, but can be biased for MAR data. For example cov(na.omit(X)) in R computes the covariance matrix using only complete cases, as does cov(X,use="complete.obs").
Available-case analysis involves using all available required data for each analysis. Because we don’t discard rows that contain missingness in variables we don’t need, this increases samples and therefore reduces variance compared to complete-case analysis. However, it means that we have different data going into each calculation. As it can be biased for MAR data, and different biases for each calculation, its not usually recommended. For example cov(X,use="everything") in R computes the covariance matrix using every pair of complete data.
Imputation involves filling in missing values.

3a. Simple imputation involves replacing missing data with a default value such as the mean.

3b. Regression imputation involves predicting missing values using a regression model based on other variables; this can be thought of as a conditional mean.

3c. Multiple imputation involves predicting missing values multiple times to create several different imputed datasets, and combining the results. Values are sampled from their conditional probability distribution (which can in principle be correlated). This is generally preferred as it accounts for the uncertainty in the imputed values, but is the most costly.

In this workshop we consider multiple imputation using the mice package in R, on a non-trivial dataset.

0. Prerequisites

Here we define the libraries we need.

library("mice")
library("dplyr")
library("mice")
library("foreign") # to import Stata DTA files
library("car")

1. Summary of this workshop

Multiple imputation is towards the “gold standard” end of imputation options. The basic idea is that we can use a good probablistic model to predict missing values conditional on observed values, which creates an ensemble of datasets which can each be treated as if it were complete.

In this workshop we explore a multiple imputation problem for a modest sized, but real and complex, dataset. The goal is to understand how to implement this in practice, as well as to question what problems might arise.

1.1 References

This analysis is adapted from Getting Started with Multiple Imputation in R from the University of Virginia. Additional content from R bloggers introduction to the mice package, both of which depend on the Multiple Imputation package original paper.

That paper includes this explanation about the structure of a “multiple imputation” object:

Multiple Imputation explained

The keywords are:

multivariate imputation by chained equations (MICE, mice()) produces a
- multiply imputed dataset (mids)
Your standard analysis pipeline is then replicated with() each dataset to produce a
- multiply imputed repeated analysis (mira)
which can then be combined by a pool() to give a
- multiple imputed pooled outcomes (mipo)

2 Data - American National Election Survey 2012 (ANES)

The data comes from the American National Election survey. This is a large, complex dataset that involves merging 3 different files. The goal is to predict sentiment towards Hillary Clinton in 2012 based on individual properties (occupation, party id, nationalism, views on China’s economic rise) as well as what is happening within that state (number of Chinese Mergers and Acquisitions (M&A) activity, 2000-2012).

The hypothesis is that working in manufacturing is linked to being impacted by mergers in manufacturing industries and therefore that these influence views.

2.1 Downloading data

This section downloads the data and puts it into the “../data” directory, in a cross-platform way.

## Dowload the ANE survey data
if(!file.exists(file.path("..","data","anesimputation.dta"))) download.file("http://static.lib.virginia.edu/statlab/materials/data/anesimputation.dta", destfile=file.path("..","data","anesimputation.dta"))

## Dowload the occupation data
if(!file.exists(file.path("..","data","anesocc.csv"))) download.file("http://static.lib.virginia.edu/statlab/materials/data/anesocc.csv", destfile=file.path("..","data","anesocc.csv"))

## Download the mergers and acquisitions data
if(!file.exists(file.path("..","data","ma.dta"))) download.file("http://static.lib.virginia.edu/statlab/materials/data/ma.dta", destfile=file.path("..","data","ma.dta"))

A basic description of the data is that we will investigate the following independent variables:

Occupation (taken from ANES supplementary files): Dichotomous variables, 1 if the respondent works in manufacturing 0 if not
Party ID: Continuous index that ranges from 0 (Strong Democrat) to 6 (Strong Republican)
Nationalism: Continuous index that ranges from 0 (Not at all Important) to 4 (Extremely Important)
Views on China’s economic rise: Dichotomous variable, 0 Good/No Effect 1 Bad
The number of Chinese Mergers and Acquisitions activity (taken from the MA file): 2000-2012, Continuous variable that ranges from 0 to 60

Optionally: also download the Rdata intermediate file (otherwise this is quite slow):

## Dowload the ANE survey data
if(!file.exists(file.path("..","data","imp2.RData"))) download.file("https://github.com/dsbristol/dst/blob/master/data/imp2.RData", destfile=file.path("..","data","imp2.RData"))

2.2 Pre-process data

Here we will do some basic data processing for later.

The following code is retained from the reference and is not very clean. It tidies up the variables in the dataframe as well as possible with the understanding that many columns will be used for imputation even if they are not in our model.

## This is the section where we require foreign and car libraries.
library(dplyr)
library(mice)
library(foreign) # to import Stata DTA files
library(car)     # for recode

set.seed(145)
# Import ANES dataset
anesimp <- read.dta(file.path("..","data","anesimputation.dta"), 
                    convert.factors = FALSE, missing.type = TRUE)

# Dataset contains values <0. Code all of them as missing 
for(i in 1:ncol(anesimp)){
  anesimp[,i] <- ifelse(anesimp[,i]<0, NA, anesimp[,i]) 
}

# Add occupation variable 
anesocc <- read.csv(file.path("..","data","anesocc.csv"),sep=";",na.strings=c("","NA"))
# Selecting occupation now and industry now variables
anesocc2 <- anesocc %>%
  dplyr::select(caseid, dem_occnow, dem_indnow)

# Coding any text that includes "manu" in it as respondent working in
# manufacturing, excluding manuver

# to address an 'invalid' string in row 3507
anesocc2$dem_occnow <- iconv(anesocc2$dem_occnow, from="UTF-8", to="")

anesocc2 <- anesocc2 %>% 
  mutate(manuf = case_when((grepl("manu",dem_occnow)&!grepl("manuver",dem_occnow)) ~ 1,
                           grepl("manu",anesocc2$dem_indnow) ~ 1,
                           is.na(dem_occnow) ~ NA_real_,
                           is.na(dem_indnow) ~ NA_real_,
                           !is.na(dem_occnow) ~ 0,
                           !is.na(dem_indnow) ~ 0)
  )

anesocc2 <- anesocc2 %>% 
  dplyr::select(manuf)

# combining by columns as they are sorted in the same order
anesimp <- cbind(anesimp,anesocc2)

# Merge M&A data 

maimp <- read.dta(file.path("..","data","ma.dta"))
anesimp <- merge(x=anesimp, y=maimp, by=c("sample_state"))

# Recode variables 
anesimp$patriot_amident <- recode(anesimp$patriot_amident, 
                                  "5=0; 4=1; 3=2; 2=3; 1=4")

anesimp$econ_ecnext_x <- recode(anesimp$econ_ecnext_x, 
                                "1=0; 2=1; 3=2; 4=3; 5=4")

anesimp$pid_x <- recode(anesimp$pid_x, 
                        "1=0; 2=1; 3=2; 4=3; 5=4; 6=5; 7=6")

anesimp$dem_edugroup_x <- recode(anesimp$dem_edugroup_x, 
                                 "1=0; 2=1; 3=2; 4=3; 5=4")

# Treat manuf as a factor 
anesimp$manuf <- as.factor(anesimp$manuf)

anesimp is now is a format that is suitable for processing. The first thing to do is examine missingness. First we will examine overall missingness per feature:

p_missing <- unlist(lapply(anesimp, function(x) sum(is.na(x))))/nrow(anesimp)
par(mar=c(10,4,4,1))
barplot(sort(p_missing[p_missing > 0], decreasing = TRUE),horiz = FALSE,las=2,cex.names = 0.85)

We can see that some features are extremely missing. Additionally, some are high-dimensional factors, which behave badly for imputation (why?). So we remove them. We will also need to convert the state into a factor.

# Select out variables that could cause problems in the imputation process
anesimp <- anesimp %>% 
  dplyr::select(-interest_whovote2008,-prevote_primvwho, -prevote_intpresst,
                -relig_ident_1st,-iwrobspre_skintone,-iwrobspre_levinfo,
                -iwrobspre_intell, -iwrobspre_interest,-gayrt_discrev_x,
                -Statename)

anesimp$sample_state <- as.factor(anesimp$sample_state)

2.3 Choice of numeric or Factor representation

This worksheet switches some columns between numeric and factors for different purposes.

For imputation, we need to treat variables as what they are, i.e. integers or factors, so that the imputed data remain like this. However, for regression it is convenient to force integer factors to be numeric for simplicity.

# Save the dataframe as another object so that we can use the original dataframe
# for multiple imputation
anesimpor <- anesimp 

# Transform variables for regression
# Treat nationalism as continuous
anesimpor$patriot_amident <- as.numeric(anesimpor$patriot_amident)
# Treat party id as continuous 
anesimpor$pid_x <- as.numeric(anesimpor$pid_x)
# Treat china_econ as dichotomous 
anesimpor$china_econ <- recode(anesimpor$china_econ, "1=0; 3=0; 2=1")
anesimpor$china_econ <- as.factor(anesimpor$china_econ)

# Take the log of Chinese M&A variables - add a small number as variable
# contains 0s
anesimpor$LogMANO <- log(anesimpor$MANo+1.01)
# Treat party id as continuous

Q: The original paper added the 1.01 offset. Were they justified to do so? What would you do?

We can now fit using lm.

# Define the variables of interest (including the variable to be predicted)
myvariables=c("ft_hclinton","manuf","pid_x","patriot_amident","china_econ","MANo","LogMANO")
# Estimate an OLS regression
## This is how they do it:
## fitols <- lm(ft_hclinton ~ manuf + pid_x + patriot_amident + china_econ + LogMANO, data=anesimpor)
## But I'm using the `myvariables` list of things we care about, to use later. This allows a simpler way to specify the problem:
fitols <- lm(ft_hclinton ~ . , data=anesimpor[,myvariables])
summary(fitols)

## 
## Call:
## lm(formula = ft_hclinton ~ ., data = anesimpor[, myvariables])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -87.666 -14.281   2.214  15.662  70.170 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     77.32265    1.54275  50.120  < 2e-16 ***
## manuf1          -0.29930    3.18884  -0.094   0.9252    
## pid_x           -8.68356    0.16381 -53.011  < 2e-16 ***
## patriot_amident  1.82351    0.37026   4.925 8.75e-07 ***
## china_econ1     -3.67245    0.69945  -5.250 1.59e-07 ***
## MANo            -0.06979    0.03516  -1.985   0.0472 *  
## LogMANO          1.47696    0.57907   2.551   0.0108 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.85 on 4442 degrees of freedom
##   (1465 observations deleted due to missingness)
## Multiple R-squared:  0.4013, Adjusted R-squared:  0.4005 
## F-statistic: 496.3 on 6 and 4442 DF,  p-value: < 2.2e-16

Q: What fraction and by what procedure are data omitted? What could this do to the estimates?

Q: Do you believe this analysis?

3. Further exploration of missingness

The structure of the missingness dramatically impacts whether missingness matters, and whether imputation will be successful. In the next plot we try to visualise the features with more than 100 missing entries.

Note that we return to anesimp although for this analysis there should be no difference.

library("gplots")

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

tmissing=as.matrix(is.na(anesimp)+0)
heatmap.2(t(tmissing[,colSums(tmissing)>100]),margins =c(9,15),trace="none",cexRow = 1)

Thats for all variables. What about for those in the regression?

tmissing_toplot=as.matrix(is.na(anesimpor[,myvariables])+0)
heatmap.2(t(tmissing_toplot),margins =c(9,15),trace="none",cexRow = 1)

Q: What conclusions would you draw from this?

Q: What other visualisations of the structure of missingness could you think of?

4. Run multiple Imputation

We are now ready to start the multiple imputation process. The original post messily goes back to the anesimp dataset, which we’ll allow, though that creates some replication of code and is therefore “poor practice”.

First we get the data into the required format:

# Use anesimp2 as the raw dataset 

anesimp2 <- anesimp

# Treat variables as factors
anesimp2$patriot_amident <- as.factor(anesimp2$patriot_amident)
anesimp2$china_econ <- as.factor(anesimp2$china_econ)
anesimp2$pid_x <- as.factor(anesimp2$pid_x)
anesimp2$sample_state <- as.factor(anesimp2$sample_state)

Now we remove columns that are problematic for imputation. Some of these have high missingness, whilst others are high-dimensional factors.

Q: How would you determine which columns to remove? And how would you check that you removed the correct ones?

We are now able to run the mice code. However, specifying the model that each variable uses directly is awkward, so the example extracts the defaults by running with 0 iterations of imputation. It then sets the models where the defaults are not suitable.

# We run the mice code with 0 iterations in order to access and modify the internals

imp <- mice(anesimp2, maxit=0)

# Extract predictorMatrix and methods of imputation 

predM <- imp$predictorMatrix
meth <- imp$method

# Setting values of variables I'd like to leave out to 0 in the predictor matrix
predM[, c("sample_state")] <- 0
predM[, c("Total_mil")] <- 0
predM[, c("PriOwn_mil")] <- 0
predM[, c("GovValue_mil")] <- 0
predM[, c("PriOwn")] <- 0
predM[, c("GovOwn")] <- 0
predM[, c("MANo")] <- 0
predM[, c("manuf")] <- 0

# If you like, view the first few rows of the predictor matrix
# head(predM)

# Specify a separate imputation model for variables of interest 

# Ordered categorical variables 
poly <- c("patriot_amident", "pid_x")

# Dichotomous variable
log <- c("manuf")

# Unordered categorical variable 
poly2 <- c("china_econ")

# Turn their methods matrix into the specified imputation models
meth[poly] <- "polr"
meth[log] <- "logreg"
meth[poly2] <- "polyreg"

meth

##              sample_state       gender_respondent_x        interest_attention 
##                        ""                        ""                     "pmm" 
##        interest_following        interest_voted2008           prmedia_wkinews 
##                     "pmm"                     "pmm"                     "pmm" 
##           prmedia_wktvnws         prmedia_wkpaprnws           prmedia_wkrdnws 
##                     "pmm"                     "pmm"                     "pmm" 
##             prevote_primv             prevote_voted           prevote_intpres 
##                     "pmm"                     "pmm"                     "pmm" 
##             prevote_inths             congapp_job_x             presapp_track 
##                     "pmm"                     "pmm"                     "pmm" 
##             presapp_job_x            presapp_econ_x         presapp_foreign_x 
##                     "pmm"                     "pmm"                     "pmm" 
##          presapp_health_x             presapp_war_x                    ft_dpc 
##                     "pmm"                     "pmm"                     "pmm" 
##                    ft_rpc                   ft_dvpc                   ft_rvpc 
##                     "pmm"                     "pmm"                     "pmm" 
##               ft_hclinton                    ft_gwb                    ft_dem 
##                     "pmm"                     "pmm"                     "pmm" 
##                    ft_rep            finance_finfam         finance_finpast_x 
##                     "pmm"                     "pmm"                     "pmm" 
##         finance_finnext_x            health_insured          health_2010hcr_x 
##                     "pmm"                     "pmm"                     "pmm" 
##               health_self              health_smoke            candaff_angdpc 
##                     "pmm"                     "pmm"                     "pmm" 
##             candaff_hpdpc            candaff_afrdpc            candaff_prddpc 
##                     "pmm"                     "pmm"                     "pmm" 
##            candaff_angrpc             candaff_hprpc            candaff_afrrpc 
##                     "pmm"                     "pmm"                     "pmm" 
##            candaff_prdrpc              libcpre_self           divgov_splitgov 
##                     "pmm"                     "pmm"                     "pmm" 
##            campfin_banads               ineq_incgap             ineq_incgap_x 
##                     "pmm"                     "pmm"                     "pmm" 
##                econ_ecnow             econ_ecpast_x             econ_ecnext_x 
##                     "pmm"                     "pmm"                     "pmm" 
##             econ_unpast_x      preswin_dutychoice_x          usworld_stronger 
##                     "pmm"                     "pmm"                     "pmm" 
##              usworld_stay                     pid_x               war_worthit 
##                     "pmm"                    "polr"                     "pmm" 
##                war_terror               gun_control            gun_importance 
##                     "pmm"                     "pmm"                     "pmm" 
##              immig_policy             immig_citizen              immig_checks 
##                     "pmm"                     "pmm"                     "pmm" 
##                    aa_uni                  aa_uni_x                 aa_work_x 
##                     "pmm"                     "pmm"                     "pmm" 
##           trustgov_corrpt              relig_import       relig_mastersummary 
##                     "pmm"                     "pmm"                     "pmm" 
##               dem_age_r_x               dem_marital            dem_edugroup_x 
##                     "pmm"                     "pmm"                     "pmm" 
##               dem_veteran dem_empstatus_1digitfin_x               dem_unionhh 
##                     "pmm"                     "pmm"                     "pmm" 
##             dem_raceeth_x              dem_nativity             dem2_numchild 
##                     "pmm"                     "pmm"                     "pmm" 
##             dem2_cellpers            dem3_yearscomm              dem3_ownhome 
##                     "pmm"                     "pmm"                     "pmm" 
##             dem3_passport             wealth_stocks            wealth_vachome 
##                     "pmm"                     "pmm"                     "pmm" 
##          wealth_ownrental          wealth_investbus          inc_incgroup_pre 
##                     "pmm"                     "pmm"                     "pmm" 
##             owngun_owngun              orientn_rgay         preknow_prestimes 
##                     "pmm"                     "pmm"                     "pmm" 
##           preknow_sizedef           preknow_senterm          preknow_medicare 
##                     "pmm"                     "pmm"                     "pmm" 
##           preknow_leastsp           happ_lifesatisf           patriot_amident 
##                     "pmm"                     "pmm"                    "polr" 
##                china_econ             sample_stfips             sample_region 
##                 "polyreg"                        ""                        "" 
##                     manuf                      MANo                    GovOwn 
##                  "logreg"                        ""                        "" 
##                    PriOwn              GovValue_mil                PriOwn_mil 
##                        ""                        ""                        "" 
##                 Total_mil 
##                        ""

Q: How do we know what these models are? How can you find out about the models available, and whether the choices are appropriate?

Now we run the multiple imputation procedure. We’re only doing it with 5 samples to keep it fast (ish).

Q: What alternatives might exist that would be more computationally convenient at scale?

# With this command, we tell mice to impute the anesimp2 data, create 5
# datasets, use predM as the predictor matrix and don't print the imputation
# process. If you would like to see the process, set print as TRUE

# Because this is very slow, we are only going to do it once, and save the output for future.

if(!file.exists(file.path("..","data","imp2.RData"))){ 
  imp2 <- mice(anesimp2, maxit = 5, 
             predictorMatrix = predM, 
             method = meth, print =  FALSE)
  saveRDS(imp2,file=file.path("..","data","imp2.RData"))
}else{
  imp2 <- readRDS(file.path("..","data","imp2.RData"))
}

We can now extract the imputed datasets simply, using the complete function:

anescomp <- mice::complete(imp2, 1)
head(anescomp)

##   sample_state gender_respondent_x interest_attention interest_following
## 1           AK                   2                  4                  3
## 2           AK                   1                  3                  3
## 3           AK                   2                  2                  2
## 4           AL                   2                  4                  1
## 5           AL                   1                  1                  1
## 6           AL                   2                  2                  1
##   interest_voted2008 prmedia_wkinews prmedia_wktvnws prmedia_wkpaprnws
## 1                  1               1               0                 3
## 2                  1               6               6                 0
## 3                  1               7               7                 7
## 4                  1               0               7                 2
## 5                  1               1               7                 0
## 6                  1               2               7                 3
##   prmedia_wkrdnws prevote_primv prevote_voted prevote_intpres prevote_inths
## 1               6             1             2               1             1
## 2               0             1             2               1             1
## 3               0             2             2               1             1
## 4               7             1             2               1             1
## 5               0             1             2               1             1
## 6               7             1             2               1             1
##   congapp_job_x presapp_track presapp_job_x presapp_econ_x presapp_foreign_x
## 1             4             2             5              5                 4
## 2             4             2             5              5                 5
## 3             5             2             4              5                 5
## 4             5             1             1              2                 1
## 5             5             1             1              1                 1
## 6             5             1             1              1                 1
##   presapp_health_x presapp_war_x ft_dpc ft_rpc ft_dvpc ft_rvpc ft_hclinton
## 1                5             4     15     70      15      60           0
## 2                5             5      0     60       0      50          15
## 3                2             1     70     60      60      70          70
## 4                1             1    100      0     100       0         100
## 5                1             1    100      0     100       0         100
## 6                1             1    100      0     100       5         100
##   ft_gwb ft_dem ft_rep finance_finfam finance_finpast_x finance_finnext_x
## 1     85     40     60              2                 2                 3
## 2     60     15     70              1                 4                 3
## 3     50     40     40              0                 2                 3
## 4      0    100      0              1                 2                 3
## 5      0    100      0              0                 2                 1
## 6     15    100      0              0                 2                 1
##   health_insured health_2010hcr_x health_self health_smoke candaff_angdpc
## 1              1                6           2            2              1
## 2              1                6           3            1              1
## 3              1                4           3            1              2
## 4              1                1           2            1              2
## 5              1                1           3            2              2
## 6              1                1           2            2              2
##   candaff_hpdpc candaff_afrdpc candaff_prddpc candaff_angrpc candaff_hprpc
## 1             1              1              2              2             1
## 2             2              1              2              2             1
## 3             2              2              2              2             2
## 4             1              2              1              1             2
## 5             1              2              1              1             2
## 6             1              2              1              1             2
##   candaff_afrrpc candaff_prdrpc libcpre_self divgov_splitgov campfin_banads
## 1              2              1            5               2              1
## 2              2              2            6               3              1
## 3              2              2            4               2              3
## 4              1              2            1               3              2
## 5              1              2            1               1              2
## 6              1              2            4               1              2
##   ineq_incgap ineq_incgap_x econ_ecnow econ_ecpast_x econ_ecnext_x
## 1           3             3          3             3             1
## 2           1             1          5             4             2
## 3           1             2          4             2             3
## 4           1             2          2             2             0
## 5           1             1          3             2             0
## 6           1             1          2             1             0
##   econ_unpast_x preswin_dutychoice_x usworld_stronger usworld_stay pid_x
## 1             3                    4                1            1     6
## 2             4                    7                1            1     4
## 3             3                    1                1            1     2
## 4             2                    7                1            1     0
## 5             3                    7                3            1     0
## 6             1                    2                3            2     0
##   war_worthit war_terror gun_control gun_importance immig_policy immig_citizen
## 1           1          3           3              3            1             1
## 2           2          1           3              4            1             2
## 3           2          1           3              2            3             1
## 4           2          3           1              2            3             1
## 5           2          3           1              1            4             1
## 6           1          2           3              2            3             1
##   immig_checks aa_uni aa_uni_x aa_work_x trustgov_corrpt relig_import
## 1            1      2        6         6               2            2
## 2            1      3        4         4               2            2
## 3            1      3        4         4               4            1
## 4            1      3        4         4               3            1
## 5            3      1        1         1               2            1
## 6            1      2        7         2               4            1
##   relig_mastersummary dem_age_r_x dem_marital dem_edugroup_x dem_veteran
## 1                 149          47           1              1           2
## 2                 880          77           1              0           2
## 3                 199          72           3              2           2
## 4                 149          86           6              0           2
## 5                 229          79           4              0           1
## 6                 400          72           3              2           2
##   dem_empstatus_1digitfin_x dem_unionhh dem_raceeth_x dem_nativity
## 1                         1           2             1            1
## 2                         5           2             1            1
## 3                         5           2             1            1
## 4                         5           2             2            1
## 5                         5           2             2            1
## 6                         1           2             2            1
##   dem2_numchild dem2_cellpers dem3_yearscomm dem3_ownhome dem3_passport
## 1             0             1             37            1             1
## 2             0             2             29            1             1
## 3             0             1             33            1             2
## 4             0             2             13            1             2
## 5             0             1             19            1             2
## 6             0             1             40            1             2
##   wealth_stocks wealth_vachome wealth_ownrental wealth_investbus
## 1             1              2                2                2
## 2             1              2                1                2
## 3             1              2                2                2
## 4             2              2                2                2
## 5             2              2                2                2
## 6             2              2                1                2
##   inc_incgroup_pre owngun_owngun orientn_rgay preknow_prestimes preknow_sizedef
## 1               24             1            1                 2               1
## 2               15             1            1                 2               1
## 3               14             1            1                 2               1
## 4                2             2            1                 2               2
## 5                1             1            1                 6               1
## 6                3             1            1                 2               1
##   preknow_senterm preknow_medicare preknow_leastsp happ_lifesatisf
## 1               4                1               2               2
## 2               6                1               3               3
## 3               6                1               4               3
## 4               4                1               3               1
## 5               3                1               1               1
## 6               8                1               3               2
##   patriot_amident china_econ sample_stfips sample_region manuf MANo GovOwn
## 1               2          0             2             4     0    1      1
## 2               3          1             2             4     0    1      1
## 3               4          1             2             4     0    1      1
## 4               2          1             1             3     0    1      1
## 5               3          0             1             3     0    1      1
## 6               4          0             1             3     0    1      1
##   PriOwn GovValue_mil PriOwn_mil Total_mil   LogMANO
## 1      0            5          0         5 0.6981347
## 2      0            5          0         5 0.6981347
## 3      0            5          0         5 0.6981347
## 4      0          186          0       186 0.6981347
## 5      0          186          0       186 0.6981347
## 6      0          186          0       186 0.6981347

The whole dataset is massive and unwieldy, but we can look at just the columns of interest when we reconstruct them (below).

Another important thing to check is the structure of the imputed data. The next way to access the imputed values extracts just imputed values for each of the 5 replicates.

# Look at some imputed values for china_econ variable 
head(imp2$imp$china_econ)

##    1 2 3 4 5
## 4  1 1 1 0 1
## 6  0 1 0 0 1
## 7  1 1 1 0 1
## 17 1 1 1 1 0
## 26 1 0 0 0 1
## 36 1 0 1 0 0

The row labels tell us which row the imputed values belong to. How could you investigate the rest of the data structure? Is it a list or a data.frame?

Q: What does the variability we observe above tell you about the imputation process? Is it what we expect?

Now we have 5 copies of our data stored in mice “mids” format. We’d like to repeat our data transformations from before, so we convert back to a data frame:

# First, turn the datasets into long format
anesimp_long <- mice::complete(imp2, action="long", include = TRUE)
print(paste("Length of data is", dim(imp2$data)[1]))

## [1] "Length of data is 5914"

print(paste("Length of long format is",dim(anesimp_long)[1]))

## [1] "Length of long format is 35484"

# Convert two variables into numeric
anesimp_long$patriot_amident <- with(anesimp_long, 
                                     as.integer(anesimp_long$patriot_amident))
anesimp_long$pid_x <- with(anesimp_long, 
                           as.integer(anesimp_long$pid_x))
anesimp_long$prevote_inths <- with(anesimp_long, 
                           as.integer(anesimp_long$prevote_inths-1))


# Take log of M&A variable 
anesimp_long$LogMANO<-log(anesimp_long$MANo+1.01)

# Convert back to mids type - mice can work with this type
anesimp_long_mids<-as.mids(anesimp_long)

We can now access the complete dataset. We’ll examine one of the sample datasets for the variables of interest, to check what it looks like:

myvariables_imp2=c("ft_hclinton","manuf","pid_x","patriot_amident","china_econ","MANo")
anescomp_long_sample <- mice::complete(imp2, 1)
myvariables_imp2 %in% colnames(anescomp_long_sample)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

head(anescomp_long_sample[,myvariables_imp2])

##   ft_hclinton manuf pid_x patriot_amident china_econ MANo
## 1           0     0     6               2          0    1
## 2          15     0     4               3          1    1
## 3          70     0     2               4          1    1
## 4         100     0     0               2          1    1
## 5         100     0     0               3          0    1
## 6         100     0     0               4          0    1

Q: How do you interpret this? Is the LogMANO variable as expected?

We can now, finally, undertake the intended regression analysis. The command below fits one regression with each of the 5 datasets, and pools the answer:

# Regression 
fitimp <- with(anesimp_long_mids,
               lm(ft_hclinton ~ manuf + pid_x +
                    patriot_amident + china_econ + LogMANO))

summary(pool(fitimp))

##              term   estimate std.error   statistic        df       p.value
## 1     (Intercept) 83.4680590 1.6242869  51.3875101  899.5402 7.429171e-270
## 2          manuf1 -1.9645291 2.7415178  -0.7165845  223.1789  4.743797e-01
## 3           pid_x -8.3909245 0.1446329 -58.0153367 5750.0970  0.000000e+00
## 4 patriot_amident  1.9520311 0.3336326   5.8508407  946.9684  6.733117e-09
## 5     china_econ1 -3.5691841 0.6264011  -5.6979209 1280.0548  1.503390e-08
## 6         LogMANO  0.5750746 0.2577293   2.2313124 5877.0254  2.569801e-02

Now we will compare the estimates:

cbind(raw_estimate=summary(fitols)$coefficients[,1],
      imputed_estimate=summary(pool(fitimp))$estimate,
      raw_pval=summary(fitols)$coefficients[,4],
      imputed_pval=summary(pool(fitimp))$p.value)

## Warning in base::cbind(...): number of rows of result is not a multiple of
## vector length (arg 2)

##                 raw_estimate imputed_estimate     raw_pval  imputed_pval
## (Intercept)      77.32265107       83.4680590 0.000000e+00 7.429171e-270
## manuf1           -0.29929899       -1.9645291 9.252260e-01  4.743797e-01
## pid_x            -8.68356320       -8.3909245 0.000000e+00  0.000000e+00
## patriot_amident   1.82351009        1.9520311 8.747126e-07  6.733117e-09
## china_econ1      -3.67244855       -3.5691841 1.587939e-07  1.503390e-08
## MANo             -0.06978578        0.5750746 4.724644e-02  2.569801e-02
## LogMANO           1.47696022       83.4680590 1.078746e-02 7.429171e-270

And.. we’re done!

5. Evaluation

… except, how do we know if it worked?

Q: What conclusions can we draw from this analysis about the effect of the covariates on the outcome?

Q: How do we know that imputation has done a good job?

Q: If the imputation went badly, how wrong could our estimates be?