Missing Data Handling for Policy Research & RCTs

David Loeb
loebd@upenn.edu

May 12, 2025

Outline

Types of missingness
Simple methods
Full information maximum likelihood (FIML)

Multiple imputation
Recommendations for use
Sensitivity analysis

Missing data can create bias

Types of missingness

Overview
MCAR
MAR
MNAR

3 types of missingness
Each has unique implications for bias stemming from missing data
Missing data strategies make assumptions about types of missingness
Types of missingness are untestable

Missing Completely At Random: Prob(miss) has no relationship with missing var

$\textcolor[rgb]{1.0,1.0,1.0}{space}$

$\textcolor[rgb]{1.0,1.0,1.0}{space}$

(Conditionally) Missing At Random: P(miss) & missing var relation explained by other vars

$P(miss_{test}) = f(income)$

$P(miss_{test}) = f(income)$

Missing Not At Random: P(miss) related to missing var, independent of other model vars

$P(miss_{test}) = f(test)$

$P(miss_{test}) = f(test)$

Simple Methods

Drop observations missing on any variables in model
Only unbiased when data are MCAR
But can perform well in certain other situations, depending on
- What variables are missing
- Missingness mechanism

$P(miss_{test}) = f(income)$

$P(miss_{income}) = f(test)$

$P(miss_{test}) = f(income)$, $P(miss_{income}) = f(test)$

$\textcolor[rgb]{1.0,1.0,1.0}{space}$

$P(miss_{test}) = f(income, treat)$

$P(miss_{test}) = f(income, treat)$, $P(miss_{income}) = f(test_{pre})$

$P(miss_{income}) = f(test_{pre}, treat)$

Positives

If data are MCAR: Unbiased
If P(missing) is independent of Y, conditional on X: Reg coefs unbiased
If data missing in Y only: Reg coefs equiv to FIML/MI
If P(miss X) ind. of treatment in RCT, treat effect may be estimated well

Negatives

Throwing out data, so always lose power
Means & variances biased unless data are MCAR
If missing X depends on Y: Reg coefs biased

Check out this tool for more RCT scenarios

Mean imputation

Overview
Miss Y
Miss X
RCT
Key Points

Impute missing data with a constant (typically the sample mean)
Add a binary missing data indicator to the model
Biased under any missingness mechanism
But can still perform well in specific RCT situations

$\textcolor[rgb]{1.0,1.0,1.0}{space}$

$P(miss_{income}) = f(test_{pre})$

$P(miss_{income}) = f(test_{pre}, treat)$

Attenuates reg coef of missing variables
Biased under any missingness mechanism
Can be acceptable for imputing baseline covariates in RCTs
- If P(miss) unrelated to treatment
- Better power than listwise deletion but riskier

Regression imputation

Full Information Maximum Likelihood (FIML)

Normal distribution

Multivariate normal distribution

Maximum likelihood

We want to find the means, variances, and covariances that create distributions that most closely match our sample of data
Start with mean, variance, and covar parameters which create a multivariate normal distribution:

	Mean	Variance	Covar
Income	$84k	$400k	40
Test Score	11.5	12	40

Based on these, we can generate a multivariate normal curve

Everyone gets a likelihood

We compute a number called a likelihood for each observation in our sample
How high does each person get on the mv normal curve given their data?

Multiply everyone’s likelihood to get total likelihood score for that specific multivariate normal distribution
See if we can increase our likelihood score by changing the distribution parameters

But don’t we want regression coefficients?

Yes! The means, vars & covars are functions of regression parameters
Our regression equation is: $test_i = \beta_0 + \beta_1income_i + \epsilon_i$
Income now has it’s own equation too: $income_i = \mu + e_i$

Our means are:
- Income: $\mu$, the sample income mean
- Test score: $\beta_0 + \beta_1 \mu$

Our variance & covariances are:
- Income variance: $\sigma^2_{e}$, the sample income variance
- Income-test score covariance: $\beta_1 \sigma^2_{e}$
- Test score variance: $\beta_1^2 \sigma^2_{\zeta} + \sigma^2_{\epsilon}$

Full information maximum likelihood (FIML)

We can still compute a likelihood for a person even if they are missing some data
We use only their observed data

Partial data helps accuracy

Limitations of FIML

Assumes multivariate normal distribution for missing data
This is violated when:
- Missing multicategorical predictors
- Model contains nonlinearities (eg interactions, polynomials) w/ missing data

Multiple Imputation

Multiple imputation is Bayesian

Bayesian approach shares similarities with maximum likelihood
Both focused on parameters that characterize probability distributions
Bayes breaks down estimation into simpler individual steps
Estimates are samples drawn from probability distributions

Markov chain Monte Carlo (MCMC)

Algorithm that estimates each unknown (model parameters, missing data) one at a time, holding other quantities constant
The full cycle repeats thousands of times
Think of it like this:
- Impute missing values, conditional on model parameters
- Estimate model parameters, conditional on imputed data
- Repeat

Missing data values are sampled

Missing data are imputed by randomly sampling values from probability distributions created by model parameters, given individual’s observed data

Multiple imputation

We can save fully imputed datasets created during these cycles
Then reanalyze them using frequentist statistics
Results will be equivalent to Bayes results

Types of multiple imputation

Model agnostic: all model variables predict all others in a round-robin manner
- MICE is most popular model agnostic algorithm
- Limitations: cannot properly impute data with nonlinearities
Model-based: data imputed in context of regression model of interest
- Most flexible type - can handle nonlinearities & multicategorical predictors

Blimp

Software for doing the Bayesian and/or multiple imputation approach
Free, easy to use, amazing user guide, can call via an R package
Download here: appliedmissingdata.com/blimp

Advantages & limitations of Bayes/mutliple imputation

Bayesian estimation is very flexible, can handle non-normal data and basically any situation you can imagine

But…

Can take a long time for models to run
Convergence can be tricky to troubleshoot
Results are sensitive to algorithmic decisions

Recommendations for Use

When to use each method

Method	Ease of Use	When to Use
Listwise Deletion	Easy	Missing Y only RCT where baseline data collected pre-randomization & power won’t be substantially harmed by reduced sample & you only care about regression coefs Always good to run for sensitivity
FIML	Medium	Anytime you meet multivariate normal assumption, ie no missing multinomial predictors or nonlinearities
Multiple Imputation	Hard	If you need to impute multinomial predictors or interactions / polynomials Anytime you don’t mind waiting for models to run

Caveats & special cases

Listwise deletion outperforms FIML/MI in two special cases:¹
- X is MNAR and independent of Y
- Logistic regression where missing data are either dichotomous Y or X (but not both) & P(missing) depends only on Y
Factored regression can enable FIML to handle interactions & polynomials²
Auxiliary variables that predict missingness and Y but are not in focal model boost performance of FIML/MI

Best method is situation dependent

Always important to consider potential missingness mechanisms
Best to take multiple approaches and test sensitivity of results
For help thinking through approaches in RCTs, check out this tool

Sensitivity Analysis

What if data are MNAR?

Untestable
But can “stress test” model params under MNAR assumptions
2 approaches: pattern mixture model & selection model

Untestable
But can “stress test” model params under MNAR assumptions
2 approaches: pattern mixture model & selection model

Pattern mixture model

Assumption: people missing Y have diff average Y values than those with observed Y
Model intercept is combo of the two groups

\[ test_i = \beta_0 + \beta_1 income_i + \epsilon_i \]

\[ \beta_0 = \beta_{0(obs)} \times p_{obs} + \beta_{0(miss)} \times p_{miss} \] $p$ = sample proportion

How does this assumption affect the intercept?

Goal: calculate $\beta_0$ under this MNAR assumption
Compare to $\beta_0$ from main model (that assumes MAR)

Estimation Step 1:
Add missing indicator to model

\[ test_i = \beta_{0(obs)} + \delta miss_i + \beta_1 income_i + \epsilon_i \]

$miss_i$ = 1 if missing test data, 0 if observed test data

$\delta = \beta_{0(miss)} - \beta_{0(obs)}$

With $\beta_{0(obs)}$ & $\delta$, we can calculate $\beta_0$

$\beta_0 = \beta_{0(obs)} \times p_{obs} + \beta_{0(miss)} \times p_{miss} \textcolor[rgb]{1.0,1.0,1.0}{(\delta + )}$

$\beta_0 = \beta_{0(obs)} \times p_{obs} + (\beta_{0(obs)} + \delta) \times p_{miss}$

Problem: no data to estimate $\delta$

Step 2: Plug in difference between groups & compute directly

Choose an effect size difference in means between groups, use to compute $\delta$

\[ \delta = d \times \sigma_{test} \]

$d$ = Cohen’s D effect size
$\sigma_{test}$ = test score SD

\[ \delta = 0.2 \times 3 = 0.6 \]

Step 3: Estimate model with this fixed difference

$test_i = \beta_{0(obs)} + \delta miss_i + \beta_1 income_i + \epsilon_i \textcolor[rgb]{1.0,1.0,1.0}{(\delta + )}$

$test_i = \beta_{0(obs)} + 0.6 \times miss_i + \beta_1 income_i + \epsilon_i$

This gives us estimate of $\beta_{0(obs)}$

Step 4: Compute overall intercept

\[ \beta_0 = \beta_{0(obs)} \times p_{obs} + (\beta_{0(obs)} + \delta) \times p_{miss} \]

Compare to main imputation/FIML model $\beta_0$ to see sensitivity to MNAR process
Check sensitivity to higher/lower effect size differences

Same process for testing slope sensitivity

\[ test_i = \beta_0 + \beta_1 treat_i + \beta_2 income_i + \epsilon_i \]

Interact missing indicator w/ variable whose slope you want to test

\[\begin{align} test_i = & \textcolor[rgb]{0.20,0.70,0.20}{\beta_{0(obs)} + \delta_0 miss_i} + \textcolor[rgb]{0.00,0.00,1.00}{\beta_{1(obs)} treat_i +} \\ & \textcolor[rgb]{0.00,0.00,1.00}{\delta_1 miss_i \times treat_i} + \beta_2 income_i + \epsilon_i \end{align}\]

$\textcolor[rgb]{0.20,0.70,0.20}{\delta_0}$ = ctrl group, missing vs observed Y mean diff
$\textcolor[rgb]{0.00,0.00,1.00}{\delta_1}$ = treat group, missing vs observed ATE diff

ATE: Diff between new treat & control group mixture means

	Observed Mean	Missing Mean
Ctrl	$\textcolor[rgb]{0.20,0.70,0.20}{\beta_{0(obs)}}$	$\textcolor[rgb]{0.20,0.70,0.20}{\beta_{0(obs)} + \delta_0}$
Treat	$\textcolor[rgb]{0.00,0.00,1.00}{\beta_{0(obs)} + \beta_{1(obs)}}$	$\textcolor[rgb]{0.00,0.00,1.00}{\beta_{0(obs)} + \delta_0 + \beta_{1(obs)} + \delta_1}$

\[ \textcolor[rgb]{0.20,0.70,0.20}{mean_{ctrl}} = \textcolor[rgb]{0.20,0.70,0.20}{\beta_0} = \textcolor[rgb]{0.20,0.70,0.20}{\beta_{0(obs)} \times p_{0(obs)}} + \textcolor[rgb]{0.20,0.70,0.20}{(\beta_{0(obs)} + \delta_0) \times p_{0(miss)}} \]

\[\begin{align} \textcolor[rgb]{0.00,0.00,1.00}{mean_{treat}} = & \textcolor[rgb]{0.00,0.00,1.00}{(\beta_{0(obs)} + \beta_{1(obs)}) \times p_{1(obs)}} + \\ & \textcolor[rgb]{0.00,0.00,1.00}{(\beta_{0(obs)} + \delta_0 + \beta_{1(obs)} + \delta_1) \times p_{1(miss)}} \end{align}\]

\[ ATE_{MNAR} = \textcolor[rgb]{0.00,0.00,1.00}{mean_{treat}} - \textcolor[rgb]{0.20,0.70,0.20}{mean_{ctrl}} \]

Conclusion

None of these methods are perfect

All make untestable assumptions
Important to think carefully about potential missingness mechanisms
Take multiple approaches based on different plausible assumptions

…and let me know if you want some help!

References

van Buuren, S. (2018). 2.7 When not to use multiple imputation. Flexible Imputation of Missing Data. https://stefvanbuuren.name/fimd/sec-when.html
Enders, C. K. (2023). Missing data: An update on the state of the art. Psychological Methods. https://doi.org/10.1037/met0000563

	Observed Mean	Missing Mean
Ctrl	\(\textcolor[rgb]{0.20,0.70,0.20}{\beta_{0(obs)}}\)	\(\textcolor[rgb]{0.20,0.70,0.20}{\beta_{0(obs)} + \delta_0}\)
Treat	\(\textcolor[rgb]{0.00,0.00,1.00}{\beta_{0(obs)} + \beta_{1(obs)}}\)	\(\textcolor[rgb]{0.00,0.00,1.00}{\beta_{0(obs)} + \delta_0 + \beta_{1(obs)} + \delta_1}\)

Missing Data Handling for Policy Research & RCTs

Outline

Missing data can create bias

Types of missingness

Simple Methods

Listwise deletion

Mean imputation

Regression imputation

Full Information Maximum Likelihood (FIML)

Normal distribution

Multivariate normal distribution

Maximum likelihood

Everyone gets a likelihood

But don’t we want regression coefficients?

Full information maximum likelihood (FIML)

Partial data helps accuracy

Limitations of FIML

Multiple Imputation

Multiple imputation is Bayesian

Markov chain Monte Carlo (MCMC)

Missing data values are sampled

Multiple imputation

Types of multiple imputation

Blimp

Advantages & limitations of Bayes/mutliple imputation

Recommendations for Use

When to use each method

Caveats & special cases

Best method is situation dependent

Sensitivity Analysis

What if data are MNAR?

Pattern mixture model

How does this assumption affect the intercept?

Estimation Step 1:Add missing indicator to model

Step 2: Plug in difference between groups & compute directly

Step 3: Estimate model with this fixed difference

Step 4: Compute overall intercept

Same process for testing slope sensitivity

ATE: Diff between new treat & control group mixture means

Conclusion

None of these methods are perfect

References

Estimation Step 1:
Add missing indicator to model