if(!require("mice")) install.packages("mice")
## Loading required package: mice
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
if(!require("dplyr")) install.packages("dplyr")
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
if(!require("foreign")) install.packages("foreign")
## Loading required package: foreign
if(!require("car")) install.packages("car")
## Loading required package: car
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library("mice")
library("dplyr")
library("mice")
library("foreign") # to import Stata DTA files
library("car")
Multiple imputation is towards the “gold standard” end of imputation options. The basic idea is that we can use a good probablistic model to predict missing values conditional on observed values, which creates an ensemble of datasets which can each be treated as if it were complete.
This analysis is adapted from Getting Started with Multiple Imputation in R from the University of Virginia. Additional content from R bloggers introduction to the mice package, both of which depend on the Multiple Imputation package original paper.
That paper includes this explanation about the structure of a “multiple imputation” object:
The keywords are: * multivariate imputation by chained equations
(MICE, mice()
) produces a * multiply imputed dataset (mids)
* Your standard analysis pipeline is then replicated with()
each dataset to produce a * multiply imputed repeated analysis (mira) *
which can then be combined by a pool()
to give a * multiple
imputed pooled outcomes (mipo)
The data comes from the American National Election survey. This is a large, complex dataset that involves merging 3 different files. The goal is to predict sentiment towards Hillary Clinton in 2012 based on individual properties (occupation, party id, nationalism, views on China’s economic rise) as well as what is happening within that state (number of Chinese Mergers and Acquisitions (M&A) activity, 2000-2012).
The hypothesis is that working in manufacturing is linked to being impacted by mergers in manufacturing industries and therefore that these influence views.
This section downloads the data and puts it into the “../data” directory, in a cross-platform way.
if(!file.exists(file.path("..","data","anesimputation.dta"))) download.file("http://static.lib.virginia.edu/statlab/materials/data/anesimputation.dta", destfile=file.path("..","data","anesimputation.dta"))
if(!file.exists(file.path("..","data","anesocc.csv"))) download.file("http://static.lib.virginia.edu/statlab/materials/data/anesocc.csv", destfile=file.path("..","data","anesocc.csv"))
if(!file.exists(file.path("..","data","ma.dta"))) download.file("http://static.lib.virginia.edu/statlab/materials/data/ma.dta", destfile=file.path("..","data","ma.dta"))
A basic description of the data is that we will investigate the following independent variables:
Here we will do some basic data processing for later.
The following code is retained from the reference and is not very clean. It tidies up the variables in the dataframe as well as possible with the understanding that many columns will be used for imputation even if they are not in our model.
## This is the section where we require foreign and car libraries.
library(dplyr)
library(mice)
library(foreign) # to import Stata DTA files
library(car) # for recode
set.seed(145)
# Import ANES dataset
anesimp <- read.dta(file.path("..","data","anesimputation.dta"),
convert.factors = FALSE, missing.type = TRUE)
# Dataset contains values <0. Code all of them as missing
for(i in 1:ncol(anesimp)){
anesimp[,i] <- ifelse(anesimp[,i]<0, NA, anesimp[,i])
}
# Add occupation variable
anesocc <- read.csv(file.path("..","data","anesocc.csv"),sep=";",na.strings=c("","NA"))
# Selecting occupation now and industry now variables
anesocc2 <- anesocc %>%
dplyr::select(caseid, dem_occnow, dem_indnow)
# Coding any text that includes "manu" in it as respondent working in
# manufacturing, excluding manuver
# to address an 'invalid' string in row 3507
anesocc2$dem_occnow <- iconv(anesocc2$dem_occnow, from="UTF-8", to="")
anesocc2 <- anesocc2 %>%
mutate(manuf = case_when((grepl("manu",dem_occnow)&!grepl("manuver",dem_occnow)) ~ 1,
grepl("manu",anesocc2$dem_indnow) ~ 1,
is.na(dem_occnow) ~ NA_real_,
is.na(dem_indnow) ~ NA_real_,
!is.na(dem_occnow) ~ 0,
!is.na(dem_indnow) ~ 0)
)
anesocc2 <- anesocc2 %>%
dplyr::select(manuf)
# combining by columns as they are sorted in the same order
anesimp <- cbind(anesimp,anesocc2)
# Merge M&A data
maimp <- read.dta(file.path("..","data","ma.dta"))
anesimp <- merge(x=anesimp, y=maimp, by=c("sample_state"))
# Recode variables
anesimp$patriot_amident <- recode(anesimp$patriot_amident,
"5=0; 4=1; 3=2; 2=3; 1=4")
anesimp$econ_ecnext_x <- recode(anesimp$econ_ecnext_x,
"1=0; 2=1; 3=2; 4=3; 5=4")
anesimp$pid_x <- recode(anesimp$pid_x,
"1=0; 2=1; 3=2; 4=3; 5=4; 6=5; 7=6")
anesimp$dem_edugroup_x <- recode(anesimp$dem_edugroup_x,
"1=0; 2=1; 3=2; 4=3; 5=4")
# Treat manuf as a factor
anesimp$manuf <- as.factor(anesimp$manuf)
anesimp
is now is a format that is suitable for
processing. The first thing to do is examine missingness. First we will
examine overall missingness per feature:
p_missing <- unlist(lapply(anesimp, function(x) sum(is.na(x))))/nrow(anesimp)
par(mar=c(10,4,4,1))
barplot(sort(p_missing[p_missing > 0], decreasing = TRUE),horiz = FALSE,las=2,cex.names = 0.85)
We can see that some features are extremely missing. Additionally, some are high-dimensional factors, which behave badly for imputation. So we remove them. We will also need to convert the state into a factor.
# Select out variables that could cause problems in the imputation process
anesimp <- anesimp %>%
dplyr::select(-interest_whovote2008,-prevote_primvwho, -prevote_intpresst,
-relig_ident_1st,-iwrobspre_skintone,-iwrobspre_levinfo,
-iwrobspre_intell, -iwrobspre_interest,-gayrt_discrev_x,
-Statename)
anesimp$sample_state <- as.factor(anesimp$sample_state)
This worksheet switches some columns between numeric and factors for different purposes.
For imputation, we need to treat variables as what they are, i.e. integers or factors, so that the imputed data remain like this. However, for regression it is convenient to force integer factors to be numeric for simplicity.
# Save the dataframe as another object so that we can use the original dataframe
# for multiple imputation
anesimpor <- anesimp
# Transform variables for regression
# Treat nationalism as continuous
anesimpor$patriot_amident <- as.numeric(anesimpor$patriot_amident)
# Treat party id as continuous
anesimpor$pid_x <- as.numeric(anesimpor$pid_x)
# Treat china_econ as dichotomous
anesimpor$china_econ <- recode(anesimpor$china_econ, "1=0; 3=0; 2=1")
anesimpor$china_econ <- as.factor(anesimpor$china_econ)
# Take the log of Chinese M&A variables - add a small number as variable
# contains 0s
anesimpor$LogMANO <- log(anesimpor$MANo+1.01)
# Treat party id as continuous
Q: The original paper added the 1.01
offset. Were they
justified to do so? What would you do?
We can now fit using lm
.
# Define the variables of interest (including the variable to be predicted)
myvariables=c("ft_hclinton","manuf","pid_x","patriot_amident","china_econ","LogMANO")
# Estimate an OLS regression
## This is how they do it:
## fitols <- lm(ft_hclinton ~ manuf + pid_x + patriot_amident + china_econ + LogMANO, data=anesimpor)
## But I'm using the `myvariables` list of things we care about, to use later. This allows a simpler way to specify the problem:
fitols <- lm(ft_hclinton ~ . , data=anesimpor[,myvariables])
summary(fitols)
##
## Call:
## lm(formula = ft_hclinton ~ ., data = anesimpor[, myvariables])
##
## Residuals:
## Min 1Q Median 3Q Max
## -86.972 -14.316 2.205 15.778 69.791
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.2653 1.4683 53.303 < 2e-16 ***
## manuf1 -0.2387 3.1897 -0.075 0.9404
## pid_x -8.6788 0.1638 -52.970 < 2e-16 ***
## patriot_amident 1.8372 0.3703 4.961 7.27e-07 ***
## china_econ1 -3.6606 0.6997 -5.232 1.75e-07 ***
## LogMANO 0.4896 0.2964 1.652 0.0987 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.85 on 4443 degrees of freedom
## (1465 observations deleted due to missingness)
## Multiple R-squared: 0.4008, Adjusted R-squared: 0.4001
## F-statistic: 594.4 on 5 and 4443 DF, p-value: < 2.2e-16
Q: What fraction and by what procedure are data omitted? What could this do to the estimates?
Q: Do you believe this analysis?
The structure of the missingness dramatically impacts whether missingness matters, and whether imputation will be successful. In the next plot we try to visualise the features with more than 100 missing entries.
Note that we return to anesimp
although for this
analysis there should be no difference.
library("gplots")
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
tmissing=as.matrix(is.na(anesimp)+0)
heatmap.2(t(tmissing[,colSums(tmissing)>100]),margins =c(9,15),trace="none",cexRow = 1)
Thats for all variables. What about for those in the regression?
tmissing_toplot=as.matrix(is.na(anesimpor[,myvariables])+0)
heatmap.2(t(tmissing_toplot),margins =c(9,15),trace="none",cexRow = 1)
Q: What conclusions would you draw from this?
Q: What other visualisations of the structure of missingness could you think of?
We are now ready to start the multiple imputation process. The
original post messily goes back to the anesimp
dataset,
which we’ll allow, though that creates some replication of code and is
therefore “poor practice”.
First we get the data into the required format:
# Use anesimp2 as the raw dataset
anesimp2 <- anesimp
# Treat variables as factors
anesimp2$patriot_amident <- as.factor(anesimp2$patriot_amident)
anesimp2$china_econ <- as.factor(anesimp2$china_econ)
anesimp2$pid_x <- as.factor(anesimp2$pid_x)
anesimp2$sample_state <- as.factor(anesimp2$sample_state)
Now we remove columns that are problematic for imputation. Some of these have high missingness, whilst others are high-dimensional factors.
Q: How would you determine which columns to remove? And how would you check that you removed the correct ones?
We are now able to run the mice
code. However,
specifying the model that each variable uses directly is awkward, so the
example extracts the defaults by running with 0 iterations of
imputation. It then sets the models where the defaults are not
suitable.
# We run the mice code with 0 iterations in order to access and modify the internals
imp <- mice(anesimp2, maxit=0)
# Extract predictorMatrix and methods of imputation
predM <- imp$predictorMatrix
meth <- imp$method
# Setting values of variables I'd like to leave out to 0 in the predictor matrix
predM[, c("sample_state")] <- 0
predM[, c("Total_mil")] <- 0
predM[, c("PriOwn_mil")] <- 0
predM[, c("GovValue_mil")] <- 0
predM[, c("PriOwn")] <- 0
predM[, c("GovOwn")] <- 0
predM[, c("MANo")] <- 0
predM[, c("manuf")] <- 0
# If you like, view the first few rows of the predictor matrix
# head(predM)
# Specify a separate imputation model for variables of interest
# Ordered categorical variables
poly <- c("patriot_amident", "pid_x")
# Dichotomous variable
log <- c("manuf")
# Unordered categorical variable
poly2 <- c("china_econ")
# Turn their methods matrix into the specified imputation models
meth[poly] <- "polr"
meth[log] <- "logreg"
meth[poly2] <- "polyreg"
meth
## sample_state gender_respondent_x interest_attention
## "" "" "pmm"
## interest_following interest_voted2008 prmedia_wkinews
## "pmm" "pmm" "pmm"
## prmedia_wktvnws prmedia_wkpaprnws prmedia_wkrdnws
## "pmm" "pmm" "pmm"
## prevote_primv prevote_voted prevote_intpres
## "pmm" "pmm" "pmm"
## prevote_inths congapp_job_x presapp_track
## "pmm" "pmm" "pmm"
## presapp_job_x presapp_econ_x presapp_foreign_x
## "pmm" "pmm" "pmm"
## presapp_health_x presapp_war_x ft_dpc
## "pmm" "pmm" "pmm"
## ft_rpc ft_dvpc ft_rvpc
## "pmm" "pmm" "pmm"
## ft_hclinton ft_gwb ft_dem
## "pmm" "pmm" "pmm"
## ft_rep finance_finfam finance_finpast_x
## "pmm" "pmm" "pmm"
## finance_finnext_x health_insured health_2010hcr_x
## "pmm" "pmm" "pmm"
## health_self health_smoke candaff_angdpc
## "pmm" "pmm" "pmm"
## candaff_hpdpc candaff_afrdpc candaff_prddpc
## "pmm" "pmm" "pmm"
## candaff_angrpc candaff_hprpc candaff_afrrpc
## "pmm" "pmm" "pmm"
## candaff_prdrpc libcpre_self divgov_splitgov
## "pmm" "pmm" "pmm"
## campfin_banads ineq_incgap ineq_incgap_x
## "pmm" "pmm" "pmm"
## econ_ecnow econ_ecpast_x econ_ecnext_x
## "pmm" "pmm" "pmm"
## econ_unpast_x preswin_dutychoice_x usworld_stronger
## "pmm" "pmm" "pmm"
## usworld_stay pid_x war_worthit
## "pmm" "polr" "pmm"
## war_terror gun_control gun_importance
## "pmm" "pmm" "pmm"
## immig_policy immig_citizen immig_checks
## "pmm" "pmm" "pmm"
## aa_uni aa_uni_x aa_work_x
## "pmm" "pmm" "pmm"
## trustgov_corrpt relig_import relig_mastersummary
## "pmm" "pmm" "pmm"
## dem_age_r_x dem_marital dem_edugroup_x
## "pmm" "pmm" "pmm"
## dem_veteran dem_empstatus_1digitfin_x dem_unionhh
## "pmm" "pmm" "pmm"
## dem_raceeth_x dem_nativity dem2_numchild
## "pmm" "pmm" "pmm"
## dem2_cellpers dem3_yearscomm dem3_ownhome
## "pmm" "pmm" "pmm"
## dem3_passport wealth_stocks wealth_vachome
## "pmm" "pmm" "pmm"
## wealth_ownrental wealth_investbus inc_incgroup_pre
## "pmm" "pmm" "pmm"
## owngun_owngun orientn_rgay preknow_prestimes
## "pmm" "pmm" "pmm"
## preknow_sizedef preknow_senterm preknow_medicare
## "pmm" "pmm" "pmm"
## preknow_leastsp happ_lifesatisf patriot_amident
## "pmm" "pmm" "polr"
## china_econ sample_stfips sample_region
## "polyreg" "" ""
## manuf MANo GovOwn
## "logreg" "" ""
## PriOwn GovValue_mil PriOwn_mil
## "" "" ""
## Total_mil
## ""
Q: How do we know what these models are? How can you find out about the models available, and whether the choices are appropriate?
Now we run the multiple imputation procedure. We’re only doing it with 5 samples to keep it fast (ish).
Q: What alternatives might exist that would be more computationally convenient at scale?
# With this command, we tell mice to impute the anesimp2 data, create 5
# datasets, use predM as the predictor matrix and don't print the imputation
# process. If you would like to see the process, set print as TRUE
# Because this is very slow, we are only going to do it once, and save the output for future.
if(!file.exists(file.path("..","data","imp2.RData"))){
imp2 <- mice(anesimp2, maxit = 5,
predictorMatrix = predM,
method = meth, print = FALSE)
saveRDS(imp2,file=file.path("..","data","imp2.RData"))
}else{
imp2 <- readRDS(file.path("..","data","imp2.RData"))
}
We can now extract the imputed datasets simply, using the
complete
function:
anescomp <- mice::complete(imp2, 1)
head(anescomp)
## sample_state gender_respondent_x interest_attention interest_following
## 1 AK 2 4 3
## 2 AK 1 3 3
## 3 AK 2 2 2
## 4 AL 2 4 1
## 5 AL 1 1 1
## 6 AL 2 2 1
## interest_voted2008 prmedia_wkinews prmedia_wktvnws prmedia_wkpaprnws
## 1 1 1 0 3
## 2 1 6 6 0
## 3 1 7 7 7
## 4 1 0 7 2
## 5 1 1 7 0
## 6 1 2 7 3
## prmedia_wkrdnws prevote_primv prevote_voted prevote_intpres prevote_inths
## 1 6 1 2 1 1
## 2 0 1 2 1 1
## 3 0 2 2 1 1
## 4 7 1 2 1 1
## 5 0 1 2 1 1
## 6 7 1 2 1 1
## congapp_job_x presapp_track presapp_job_x presapp_econ_x presapp_foreign_x
## 1 4 2 5 5 4
## 2 4 2 5 5 5
## 3 5 2 4 5 5
## 4 5 1 1 2 1
## 5 5 1 1 1 1
## 6 5 1 1 1 1
## presapp_health_x presapp_war_x ft_dpc ft_rpc ft_dvpc ft_rvpc ft_hclinton
## 1 5 4 15 70 15 60 0
## 2 5 5 0 60 0 50 15
## 3 2 1 70 60 60 70 70
## 4 1 1 100 0 100 0 100
## 5 1 1 100 0 100 0 100
## 6 1 1 100 0 100 5 100
## ft_gwb ft_dem ft_rep finance_finfam finance_finpast_x finance_finnext_x
## 1 85 40 60 2 2 3
## 2 60 15 70 1 4 3
## 3 50 40 40 0 2 3
## 4 0 100 0 1 2 3
## 5 0 100 0 0 2 1
## 6 15 100 0 0 2 1
## health_insured health_2010hcr_x health_self health_smoke candaff_angdpc
## 1 1 6 2 2 1
## 2 1 6 3 1 1
## 3 1 4 3 1 2
## 4 1 1 2 1 2
## 5 1 1 3 2 2
## 6 1 1 2 2 2
## candaff_hpdpc candaff_afrdpc candaff_prddpc candaff_angrpc candaff_hprpc
## 1 1 1 2 2 1
## 2 2 1 2 2 1
## 3 2 2 2 2 2
## 4 1 2 1 1 2
## 5 1 2 1 1 2
## 6 1 2 1 1 2
## candaff_afrrpc candaff_prdrpc libcpre_self divgov_splitgov campfin_banads
## 1 2 1 5 2 1
## 2 2 2 6 3 1
## 3 2 2 4 2 3
## 4 1 2 1 3 2
## 5 1 2 1 1 2
## 6 1 2 4 1 2
## ineq_incgap ineq_incgap_x econ_ecnow econ_ecpast_x econ_ecnext_x
## 1 3 3 3 3 1
## 2 1 1 5 4 2
## 3 1 2 4 2 3
## 4 1 2 2 2 0
## 5 1 1 3 2 0
## 6 1 1 2 1 0
## econ_unpast_x preswin_dutychoice_x usworld_stronger usworld_stay pid_x
## 1 3 4 1 1 6
## 2 4 7 1 1 4
## 3 3 1 1 1 2
## 4 2 7 1 1 0
## 5 3 7 3 1 0
## 6 1 2 3 2 0
## war_worthit war_terror gun_control gun_importance immig_policy immig_citizen
## 1 1 3 3 3 1 1
## 2 2 1 3 4 1 2
## 3 2 1 3 2 3 1
## 4 2 3 1 2 3 1
## 5 2 3 1 1 4 1
## 6 1 2 3 2 3 1
## immig_checks aa_uni aa_uni_x aa_work_x trustgov_corrpt relig_import
## 1 1 2 6 6 2 2
## 2 1 3 4 4 2 2
## 3 1 3 4 4 4 1
## 4 1 3 4 4 3 1
## 5 3 1 1 1 2 1
## 6 1 2 7 2 4 1
## relig_mastersummary dem_age_r_x dem_marital dem_edugroup_x dem_veteran
## 1 149 47 1 1 2
## 2 880 77 1 0 2
## 3 199 72 3 2 2
## 4 149 86 6 0 2
## 5 229 79 4 0 1
## 6 400 72 3 2 2
## dem_empstatus_1digitfin_x dem_unionhh dem_raceeth_x dem_nativity
## 1 1 2 1 1
## 2 5 2 1 1
## 3 5 2 1 1
## 4 5 2 2 1
## 5 5 2 2 1
## 6 1 2 2 1
## dem2_numchild dem2_cellpers dem3_yearscomm dem3_ownhome dem3_passport
## 1 0 1 37 1 1
## 2 0 2 29 1 1
## 3 0 1 33 1 2
## 4 0 2 13 1 2
## 5 0 1 19 1 2
## 6 0 1 40 1 2
## wealth_stocks wealth_vachome wealth_ownrental wealth_investbus
## 1 1 2 2 2
## 2 1 2 1 2
## 3 1 2 2 2
## 4 2 2 2 2
## 5 2 2 2 2
## 6 2 2 1 2
## inc_incgroup_pre owngun_owngun orientn_rgay preknow_prestimes preknow_sizedef
## 1 24 1 1 2 1
## 2 15 1 1 2 1
## 3 14 1 1 2 1
## 4 2 2 1 2 2
## 5 1 1 1 6 1
## 6 3 1 1 2 1
## preknow_senterm preknow_medicare preknow_leastsp happ_lifesatisf
## 1 4 1 2 2
## 2 6 1 3 3
## 3 6 1 4 3
## 4 4 1 3 1
## 5 3 1 1 1
## 6 8 1 3 2
## patriot_amident china_econ sample_stfips sample_region manuf MANo GovOwn
## 1 2 0 2 4 0 1 1
## 2 3 1 2 4 0 1 1
## 3 4 1 2 4 0 1 1
## 4 2 1 1 3 0 1 1
## 5 3 0 1 3 0 1 1
## 6 4 0 1 3 0 1 1
## PriOwn GovValue_mil PriOwn_mil Total_mil LogMANO
## 1 0 5 0 5 0.6981347
## 2 0 5 0 5 0.6981347
## 3 0 5 0 5 0.6981347
## 4 0 186 0 186 0.6981347
## 5 0 186 0 186 0.6981347
## 6 0 186 0 186 0.6981347
The whole dataset is massive and unwieldy, but we can look at just the columns of interest when we reconstruct them (below).
Another important thing to check is the structure of the imputed data. The next way to access the imputed values extracts just imputed values for each of the 5 replicates.
# Look at some imputed values for china_econ variable
head(imp2$imp$china_econ)
## 1 2 3 4 5
## 4 1 1 1 0 1
## 6 0 1 0 0 1
## 7 1 1 1 0 1
## 17 1 1 1 1 0
## 26 1 0 0 0 1
## 36 1 0 1 0 0
The row labels tell us which row the imputed values belong to. How could you investigate the rest of the data structure? Is it a list or a data.frame?
Q: What does the variability we observe above tell you about the imputation process? Is it what we expect?
Now we have 5 copies of our data stored in mice
“mids”
format. We’d like to repeat our data transformations from before, so we
convert back to a data frame:
# First, turn the datasets into long format
anesimp_long <- mice::complete(imp2, action="long", include = TRUE)
print(paste("Length of data is", dim(imp2$data)[1]))
## [1] "Length of data is 5914"
print(paste("Length of long format is",dim(anesimp_long)[1]))
## [1] "Length of long format is 35484"
# Convert two variables into numeric
anesimp_long$patriot_amident <- with(anesimp_long,
as.integer(anesimp_long$patriot_amident))
anesimp_long$pid_x <- with(anesimp_long,
as.integer(anesimp_long$pid_x))
anesimp_long$prevote_inths <- with(anesimp_long,
as.integer(anesimp_long$prevote_inths-1))
# Take log of M&A variable
anesimp_long$LogMANO<-log(anesimp_long$MANo+1.01)
# Convert back to mids type - mice can work with this type
anesimp_long_mids<-as.mids(anesimp_long)
We can now access the complete dataset. We’ll examine one of the sample datasets for the variables of interest, to check what it looks like:
anescomp_long_sample <- mice::complete(imp2, 1)
head(anescomp_long_sample[,myvariables])
## ft_hclinton manuf pid_x patriot_amident china_econ LogMANO
## 1 0 0 6 2 0 0.6981347
## 2 15 0 4 3 1 0.6981347
## 3 70 0 2 4 1 0.6981347
## 4 100 0 0 2 1 0.6981347
## 5 100 0 0 3 0 0.6981347
## 6 100 0 0 4 0 0.6981347
Q: How do you interpret this? Is the LogMANO
variable as
expected?
We can now, finally, undertake the intended regression analysis. The
command below fits one regression with
each of the 5
datasets, and pool
s the answer:
# Regression
fitimp <- with(anesimp_long_mids,
lm(ft_hclinton ~ manuf + pid_x +
patriot_amident + china_econ + LogMANO))
summary(pool(fitimp))
## term estimate std.error statistic df p.value
## 1 (Intercept) 83.4680590 1.6242869 51.3875101 899.5402 7.429171e-270
## 2 manuf1 -1.9645291 2.7415178 -0.7165845 223.1789 4.743797e-01
## 3 pid_x -8.3909245 0.1446329 -58.0153367 5750.0970 0.000000e+00
## 4 patriot_amident 1.9520311 0.3336326 5.8508407 946.9684 6.733117e-09
## 5 china_econ1 -3.5691841 0.6264011 -5.6979209 1280.0548 1.503390e-08
## 6 LogMANO 0.5750746 0.2577293 2.2313124 5877.0254 2.569801e-02
Now we will compare the estimates:
cbind(raw_estimate=summary(fitols)$coefficients[,1],
imputed_estimate=summary(pool(fitimp))$estimate,
raw_pval=summary(fitols)$coefficients[,4],
imputed_pval=summary(pool(fitimp))$p.value)
## raw_estimate imputed_estimate raw_pval imputed_pval
## (Intercept) 78.2652795 83.4680590 0.000000e+00 7.429171e-270
## manuf1 -0.2386601 -1.9645291 9.403605e-01 4.743797e-01
## pid_x -8.6787654 -8.3909245 0.000000e+00 0.000000e+00
## patriot_amident 1.8372318 1.9520311 7.268495e-07 6.733117e-09
## china_econ1 -3.6605796 -3.5691841 1.754394e-07 1.503390e-08
## LogMANO 0.4895650 0.5750746 9.866168e-02 2.569801e-02
And.. we’re done!
… except, how do we know if it worked?
Q: What conclusions can we draw from this analysis about the effect of the covariates on the outcome?
Q: How do we know that imputation has done a good job?
Q: If the imputation went badly, how wrong could our estimates be?