Missing Data

# Missing Data

**Session 18**

]

---
class: title title-1

# Visualizing missingness

```r
library(visdat)
vis_dat(nhanes)
```
]

.right-plot[
<img src="18-missing-data_files/figure-html/unnamed-chunk-2-1.png" width="504" style="display: block; margin: auto;" />
]
---

# <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M624 416H381.54c-.74 19.81-14.71 32-32.74 32H288c-18.69 0-33.02-17.47-32.77-32H16c-8.8 0-16 7.2-16 16v16c0 35.2 28.8 64 64 64h512c35.2 0 64-28.8 64-64v-16c0-8.8-7.2-16-16-16zM576 48c0-26.4-21.6-48-48-48H112C85.6 0 64 21.6 64 48v336h512V48zm-64 272H128V64h384v256z"/></svg> Application Exercise

1. `install.packages("visdat")`
2. Load the causaldata package and use the `vis_dat()` function to examine the `nhefs` data frame
3. Fit a linear model predicting `pregnancies` from `birthcontrol`
4. Knit, commit, push to GitHub

---
class: title title-1

# Types of Missing Data

.box-1[The probability a variable is missing depends only on available information (the other variables)]
]

---

# Types of Missing Data

.box-1[The probability a variable is missing depends only on available information (the other variables)]
]

---

# Imputation

.box-1[Use the observed variables to create many *models* to predict what the missing value would be (multiple imputation)]

---

# Predictive mean matching

.box-1[For observation i, finds the set of observations that have the closest predicted value for variable A.]

---

# Example

```r
library(visdat)
vis_dat(nhanes)
```
]

.right-plot[
<img src="18-missing-data_files/figure-html/unnamed-chunk-4-1.png" width="504" style="display: block; margin: auto;" />
]

---
class:  title title-1

# Predictive mean matching

```r
library(mice)
set.seed(1)
nhanes_imp <- mice(nhanes, m = 1, method = "pmm")
```

---

# Predictive mean matching

```r
*library(mice)
set.seed(1)
nhanes_imp <- mice(nhanes, m = 1, method = "pmm")
```

---

# Predictive mean matching

```r
library(mice)
set.seed(1)
*nhanes_imp <- mice(nhanes, m = 1, method = "pmm")
```

```
## 
##  iter imp variable
##   1   1  bmi  hyp  chl
##   2   1  bmi  hyp  chl
##   3   1  bmi  hyp  chl
##   4   1  bmi  hyp  chl
##   5   1  bmi  hyp  chl
```

---

# Predictive mean matching

```r
nhanes_imp$imp$bmi
```

```
##       1
## 1  29.6
## 3  27.2
## 4  20.4
## 6  25.5
## 10 21.7
## 11 22.0
## 12 22.7
## 16 26.3
## 21 30.1
```

]

```r
nhanes %>%
  filter(bmi == 29.6)
```

```
##    age  bmi hyp chl
## 15   1 29.6   1  NA
```
]

---

# Predictive mean matching

```r
complete(nhanes_imp)
```

```
##    age  bmi hyp chl
## 1    1 29.6   1 131
## 2    2 22.7   1 187
## 3    1 27.2   1 187
## 4    3 20.4   2 204
## 5    1 20.4   1 113
## 6    3 25.5   2 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2 21.7   2 131
## 11   1 22.0   1 187
## 12   2 22.7   1 187
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1 187
## 16   1 26.3   1 187
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2 218
## 21   1 30.1   1 187
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1 284
## 25   2 27.4   1 186
```
---

1. `install.packages("mice")`
2. Use the `mice()` function to create a single imputed dataset of the `nhefs` data using predictive mean matching
3. Fit a linear model predicting `pregancies` from `birthcontrol` on the imputed dataset
4. Explore the help files for `?mice`. What types of univariate imputation methods are included?

---

# Multiple imputation

```r
set.seed(1)
nhanes_imp <- mice(nhanes,
                   m = 5, 
                   method = "pmm")
```

---

# Multiple imputation

```r
set.seed(1)
nhanes_imp <- mice(nhanes,
*                  m = 5,
                   method = "pmm")
```
---

# Multiple imputation

```r
nhanes_imp$imp$bmi
```

```
##       1    2    3    4    5
## 1  27.2 35.3 27.2 29.6 24.9
## 3  28.7 27.2 35.3 30.1 29.6
## 4  25.5 20.4 22.0 20.4 25.5
## 6  24.9 21.7 22.7 24.9 27.4
## 10 28.7 22.7 20.4 20.4 22.7
## 11 30.1 29.6 35.3 22.7 29.6
## 12 22.0 27.2 26.3 27.4 22.5
## 16 22.0 27.2 35.3 22.7 24.9
## 21 26.3 29.6 27.2 29.6 27.2
```

---

# Multiple imputation

```r
complete(nhanes_imp, 1)
```

```
##    age  bmi hyp chl
## 1    1 27.2   1 199
## 2    2 22.7   1 187
## 3    1 28.7   1 187
## 4    3 25.5   1 218
## 5    1 20.4   1 113
## 6    3 24.9   1 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2 28.7   1 206
## 11   1 30.1   1 199
## 12   2 22.0   1 187
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1 206
## 16   1 22.0   1 187
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2 199
## 21   1 26.3   1 229
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1 206
## 25   2 27.4   1 186
```

---

# Multiple imputation

```r
complete(nhanes_imp, 2)
```

```
##    age  bmi hyp chl
## 1    1 35.3   1 284
## 2    2 22.7   1 187
## 3    1 27.2   1 187
## 4    3 20.4   1 118
## 5    1 20.4   1 113
## 6    3 21.7   2 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2 22.7   1 187
## 11   1 29.6   1 206
## 12   2 27.2   1 199
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1 186
## 16   1 27.2   1 184
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2 218
## 21   1 29.6   1 204
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1 238
## 25   2 27.4   1 186
```

---

# Multiple imputation

```r
complete(nhanes_imp, 3)
```

```
##    age  bmi hyp chl
## 1    1 27.2   1 187
## 2    2 22.7   1 187
## 3    1 35.3   1 187
## 4    3 22.0   1 184
## 5    1 20.4   1 113
## 6    3 22.7   2 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2 20.4   1 187
## 11   1 35.3   2 184
## 12   2 26.3   1 206
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1 229
## 16   1 35.3   2 187
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2 206
## 21   1 27.2   1 131
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1 284
## 25   2 27.4   1 186
```

---

# Multiple imputation

```r
complete(nhanes_imp, 4)
```

```
##    age  bmi hyp chl
## 1    1 29.6   1 187
## 2    2 22.7   1 187
## 3    1 30.1   1 187
## 4    3 20.4   1 187
## 5    1 20.4   1 113
## 6    3 24.9   1 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2 20.4   1 238
## 11   1 22.7   1 131
## 12   2 27.4   2 184
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1 229
## 16   1 22.7   1 113
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2 184
## 21   1 29.6   1 206
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1 199
## 25   2 27.4   1 186
```

---

# Multiple imputation

```r
complete(nhanes_imp, 5)
```

```
##    age  bmi hyp chl
## 1    1 24.9   1 238
## 2    2 22.7   1 187
## 3    1 29.6   1 187
## 4    3 25.5   2 238
## 5    1 20.4   1 113
## 6    3 27.4   1 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2 22.7   1 187
## 11   1 29.6   1 206
## 12   2 22.5   1 187
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1 206
## 16   1 24.9   1 238
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2 187
## 21   1 27.2   1 187
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1 218
## 25   2 27.4   1 186
```

---
class: title title-1

1. Use the `mice()` function to create a 5 imputed datasets of the `nhefs` data using predictive mean matching
3. Fit a linear model predicting `pregnancies` from `birthcontrol` on each of the 5 imputed datasets
4. Knit, commit, and push to GitHub

---

# Pool Results

```r
fit <- with(data = nhanes_imp, lm(age ~ bmi))
```

---

# Pool Results

```r
fit <- with(data = nhanes_imp, lm(age ~ bmi))
fit
```

```
## call :
## with.mids(data = nhanes_imp, expr = lm(age ~ bmi))
## 
## call1 :
## mice(data = nhanes, m = 5, method = "pmm")
## 
## nmis :
## age bmi hyp chl 
##   0   9   8  10 
## 
## analyses :
## [[1]]
## 
## Call:
## lm(formula = age ~ bmi)
## 
## Coefficients:
## (Intercept)          bmi  
##     3.71598     -0.07405  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = age ~ bmi)
## 
## Coefficients:
## (Intercept)          bmi  
##      4.5686      -0.1054  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = age ~ bmi)
## 
## Coefficients:
## (Intercept)          bmi  
##     4.28034     -0.09311  
## 
## 
## [[4]]
## 
## Call:
## lm(formula = age ~ bmi)
## 
## Coefficients:
## (Intercept)          bmi  
##     3.84211     -0.07974  
## 
## 
## [[5]]
## 
## Call:
## lm(formula = age ~ bmi)
## 
## Coefficients:
## (Intercept)          bmi  
##     3.74803     -0.07538
```

---

# Pool Results

```r
fit <- with(data = nhanes_imp, lm(age ~ bmi))
pool(fit)
```

```
## Class: mipo    m = 5 
##          term m    estimate        ubar            b           t dfcom       df
## 1 (Intercept) 5  4.03102333 1.069419646 0.1415549368 1.239285570    23 16.86913
## 2         bmi 5 -0.08554488 0.001494635 0.0001806221 0.001711381    23 17.25864
##         riv    lambda       fmi
## 1 0.1588394 0.1370676 0.2239293
## 2 0.1450164 0.1266501 0.2128701
```

---

# Pool Results

```r
fit <- with(data = nhanes_imp, lm(age ~ bmi))
*pool(fit)
```

---

# Pool Results

```r
with(data = nhanes_imp, lm(age ~ bmi)) %>%
  pool() %>%
  tidy()
```

```
##          term    estimate  std.error statistic     p.value            b
## 1 (Intercept)  4.03102333 1.11323204  3.621009 0.002132875 0.1415549368
## 2         bmi -0.08554488 0.04136885 -2.067857 0.053982698 0.0001806221
##         df dfcom       fmi    lambda m       riv        ubar
## 1 16.86913    23 0.2239293 0.1370676 5 0.1588394 1.069419646
## 2 17.25864    23 0.2128701 0.1266501 5 0.1450164 0.001494635
```

---

# Pooling effect estimates

.box-1.medium[
`$$\bar\theta = \frac{1}{m}\sum_{i=1}^m\theta_i$$`
]

---

# Pooling standard errors

.box-1.medium[
`$$V_{Total} = V_{within} + V_{between} + \frac{V_{between}}{m}$$`
]

---

# Pooling standard errors

.box-1.medium[
`$$V_{within} =\frac{1}{m}\sum_{i=1}^mSE_i^2$$`
]

---

# Pooling standard errors

.box-1.medium[
`$$V_{between} =\frac{\sum_{i=1}^m(\theta_i-\bar{\theta})^2}{m-1}$$`
]

---

1. Use the `with()` function to fit a model on each of your 5 imputed datasets.
2. Use the `pool()` function to pool the results. Compare to your previous results
3. Knit, commit, and push to GitHub

---

# Adding in the PS weighting

.box-inv-1[Create m imputed datasets, estimate the propensity score within each one then average across the m datasets, pool the results]

.box-inv-1[Create m imputed datasets, estimate the propensity score within each one, run a seperate outcome model for each propensity score, pool the results.]

---

# Adding in the PS weighting

```r
nhanes$exp <- rbinom(25, 1, 0.5)
nhanes_imp <- mice(nhanes, m = 5, method = "pmm")
```

---

# MatchThem package

```r
library(MatchThem)
nhanes_weighted <- weightthem(exp ~ bmi + hyp + chl, 
                              nhanes_imp,
                              approach = "across",
                              estimand = "ATE")
```

---

# MatchThem package

```r
*library(MatchThem)
nhanes_weighted <- weightthem(exp ~ bmi + hyp + chl, 
                              nhanes_imp,
                              approach = "across",
                              estimand = "ATE")
```

---

# MatchThem package

```r
library(MatchThem)
*nhanes_weighted <- weightthem(exp ~ bmi + hyp + chl,
                              nhanes_imp,
                              approach = "across",
                              estimand = "ATE")
```

---

# MatchThem package

```r
library(MatchThem)
nhanes_weighted <- weightthem(exp ~ bmi + hyp + chl, 
*                             nhanes_imp,
                              approach = "across",
                              estimand = "ATE")
```

---
class: title title-1

# MatchThem package

```r
library(MatchThem)
nhanes_weighted <- weightthem(exp ~ bmi + hyp + chl, 
                              nhanes_imp,
*                             approach = "across",
                              estimand = "ATE") 
```

---
class: title title-1

# MatchThem package

```r
library(MatchThem)
nhanes_weighted <- weightthem(exp ~ bmi + hyp + chl, 
                              nhanes_imp,
                              approach = "across",
*                             estimand = "ATE")
```

---
class: title title-1

# MatchThem package

```r
library(survey)
with(nhanes_weighted, svyglm(age ~ exp)) %>%
       pool() %>%
       tidy()
```

```
##          term   estimate std.error statistic      p.value b       df dfcom
## 1 (Intercept) 1.57185008 0.1910444 8.2276685 4.828489e-08 0 21.22865    23
## 2         exp 0.06565242 0.3074550 0.2135351 8.329471e-01 0 21.22865    23
##          fmi lambda m riv       ubar
## 1 0.08254692      0 5   0 0.03649797
## 2 0.08254692      0 5   0 0.09452855
```

---
class: title title-1

# MatchThem package

```r
*library(survey)
with(nhanes_weighted, svyglm(age ~ exp)) %>%
       pool() %>%
       tidy()
```

---

# MatchThem package

```r
library(survey)
*with(nhanes_weighted, svyglm(age ~ exp)) %>%
       pool() %>%
       tidy()
```

---

# MatchThem package

```r
library(survey)
with(nhanes_weighted, svyglm(age ~ exp)) %>%
*      pool() %>%
       tidy()
```

---

# MatchThem package

```r
library(survey)
with(nhanes_weighted, svyglm(age ~ exp)) %>%
       pool() %>%
*      tidy()
```

---

# MatchThem package

```r
library(MatchThem)
nhanes_weighted <- weightthem(exp ~ bmi + hyp + chl, 
                              nhanes_imp,
*                             approach = "within",
                              estimand = "ATE") 
```

---
class: title title-1

# MatchThem package

```r
library(survey)
with(nhanes_weighted, svyglm(age ~ exp)) %>%
       pool() %>%
       tidy() 
```

```
##          term   estimate std.error statistic      p.value           b        df
## 1 (Intercept) 1.58535946 0.2354188 6.7342084 9.412533e-05 0.017759358  8.811045
## 2         exp 0.03782741 0.3140540 0.1204487 9.053624e-01 0.005292547 19.462903
##   dfcom       fmi     lambda m       riv       ubar
## 1    23 0.4887464 0.38452637 5 0.6247650 0.03411080
## 2    23 0.1476952 0.06439279 5 0.0688246 0.09227887
```

---

1. `install.packages("MatchThem")`
2. Use the `weightthem()` function to estimate the uncertainty for the causal effect of `birthcontrol` on `pregnancies`
2. Knit, Commit, Push to GitHub

---