Missing Data Handling for Policy Research & RCTs

David Loeb
loebd@upenn.edu

May 12, 2025

Outline


  1. Types of missingness
  2. Simple methods
  3. Full information maximum likelihood (FIML)
  1. Multiple imputation
  2. Recommendations for use
  3. Sensitivity analysis

Missing data can create bias

Types of missingness

  • 3 types of missingness
  • Each has unique implications for bias stemming from missing data
  • Missing data strategies make assumptions about types of missingness
  • Types of missingness are untestable

Missing Completely At Random: Prob(miss) has no relationship with missing var

\(\textcolor[rgb]{1.0,1.0,1.0}{space}\)

\(\textcolor[rgb]{1.0,1.0,1.0}{space}\)

(Conditionally) Missing At Random: P(miss) & missing var relation explained by other vars

\(P(miss_{test}) = f(income)\)

\(P(miss_{test}) = f(income)\)

Missing Not At Random: P(miss) related to missing var, independent of other model vars

\(P(miss_{test}) = f(test)\)

\(P(miss_{test}) = f(test)\)

Simple Methods

Listwise deletion

  • Drop observations missing on any variables in model
  • Only unbiased when data are MCAR
  • But can perform well in certain other situations, depending on
    • What variables are missing
    • Missingness mechanism

\(P(miss_{test}) = f(income)\)

\(P(miss_{income}) = f(test)\)

\(P(miss_{test}) = f(income)\), \(P(miss_{income}) = f(test)\)

\(\textcolor[rgb]{1.0,1.0,1.0}{space}\)

\(P(miss_{test}) = f(income, treat)\)

\(P(miss_{test}) = f(income, treat)\), \(P(miss_{income}) = f(test_{pre})\)

\(P(miss_{income}) = f(test_{pre}, treat)\)

Positives

  • If data are MCAR: Unbiased
  • If P(missing) is independent of Y, conditional on X: Reg coefs unbiased
  • If data missing in Y only: Reg coefs equiv to FIML/MI
  • If P(miss X) ind. of treatment in RCT, treat effect may be estimated well

Negatives

  • Throwing out data, so always lose power
  • Means & variances biased unless data are MCAR
  • If missing X depends on Y: Reg coefs biased

Check out this tool for more RCT scenarios

Mean imputation

  • Impute missing data with a constant (typically the sample mean)
  • Add a binary missing data indicator to the model
  • Biased under any missingness mechanism
  • But can still perform well in specific RCT situations

\(\textcolor[rgb]{1.0,1.0,1.0}{space}\)

\(P(miss_{income}) = f(test_{pre})\)

\(P(miss_{income}) = f(test_{pre}, treat)\)

  • Attenuates reg coef of missing variables
  • Biased under any missingness mechanism
  • Can be acceptable for imputing baseline covariates in RCTs
    • If P(miss) unrelated to treatment
    • Better power than listwise deletion but riskier

Regression imputation

Full Information Maximum Likelihood (FIML)

Normal distribution

Multivariate normal distribution

Maximum likelihood

  • We want to find the means, variances, and covariances that create distributions that most closely match our sample of data
  • Start with mean, variance, and covar parameters which create a multivariate normal distribution:

Mean

Variance

Covar

Income

$84k

$400k

40

Test Score

11.5

12

  • Based on these, we can generate a multivariate normal curve

Everyone gets a likelihood

  • We compute a number called a likelihood for each observation in our sample
  • How high does each person get on the mv normal curve given their data?
  • Multiply everyone’s likelihood to get total likelihood score for that specific multivariate normal distribution
  • See if we can increase our likelihood score by changing the distribution parameters

But don’t we want regression coefficients?

  • Yes! The means, vars & covars are functions of regression parameters
  • Our regression equation is: \(test_i = \beta_0 + \beta_1income_i + \epsilon_i\)
  • Income now has it’s own equation too: \(income_i = \mu + e_i\)
  • Our means are:
    • Income: \(\mu\), the sample income mean
    • Test score: \(\beta_0 + \beta_1 \mu\)
  • Our variance & covariances are:
    • Income variance: \(\sigma^2_{e}\), the sample income variance
    • Income-test score covariance: \(\beta_1 \sigma^2_{e}\)
    • Test score variance: \(\beta_1^2 \sigma^2_{\zeta} + \sigma^2_{\epsilon}\)

Full information maximum likelihood (FIML)

  • We can still compute a likelihood for a person even if they are missing some data
  • We use only their observed data

Partial data helps accuracy


Limitations of FIML

  • Assumes multivariate normal distribution for missing data
  • This is violated when:
    • Missing multicategorical predictors
    • Model contains nonlinearities (eg interactions, polynomials) w/ missing data

Multiple Imputation

Multiple imputation is Bayesian

  • Bayesian approach shares similarities with maximum likelihood
  • Both focused on parameters that characterize probability distributions
  • Bayes breaks down estimation into simpler individual steps
  • Estimates are samples drawn from probability distributions

Markov chain Monte Carlo (MCMC)

  • Algorithm that estimates each unknown (model parameters, missing data) one at a time, holding other quantities constant
  • The full cycle repeats thousands of times
  • Think of it like this:
    • Impute missing values, conditional on model parameters
    • Estimate model parameters, conditional on imputed data
    • Repeat

Missing data values are sampled

Missing data are imputed by randomly sampling values from probability distributions created by model parameters, given individual’s observed data

Multiple imputation

  • We can save fully imputed datasets created during these cycles
  • Then reanalyze them using frequentist statistics
  • Results will be equivalent to Bayes results

Types of multiple imputation

  • Model agnostic: all model variables predict all others in a round-robin manner
    • MICE is most popular model agnostic algorithm
    • Limitations: cannot properly impute data with nonlinearities
  • Model-based: data imputed in context of regression model of interest
    • Most flexible type - can handle nonlinearities & multicategorical predictors

Blimp

  • Software for doing the Bayesian and/or multiple imputation approach
  • Free, easy to use, amazing user guide, can call via an R package
  • Download here: appliedmissingdata.com/blimp

Advantages & limitations of Bayes/mutliple imputation

  • Bayesian estimation is very flexible, can handle non-normal data and basically any situation you can imagine

But…

  • Can take a long time for models to run
  • Convergence can be tricky to troubleshoot
  • Results are sensitive to algorithmic decisions

Recommendations for Use

When to use each method

Method Ease of Use When to Use
Listwise Deletion Easy  

  • Missing Y only
  • RCT where baseline data collected pre-randomization
  • & power won’t be substantially harmed by reduced sample
  • & you only care about regression coefs
  • Always good to run for sensitivity

FIML Medium  

  • Anytime you meet multivariate normal assumption, ie no missing multinomial predictors or nonlinearities

Multiple Imputation Hard  

  • If you need to impute multinomial predictors or interactions / polynomials
  • Anytime you don’t mind waiting for models to run

Caveats & special cases

  • Listwise deletion outperforms FIML/MI in two special cases:1
    • X is MNAR and independent of Y
    • Logistic regression where missing data are either dichotomous Y or X (but not both) & P(missing) depends only on Y
  • Factored regression can enable FIML to handle interactions & polynomials2
  • Auxiliary variables that predict missingness and Y but are not in focal model boost performance of FIML/MI

Best method is situation dependent

  • Always important to consider potential missingness mechanisms
  • Best to take multiple approaches and test sensitivity of results
  • For help thinking through approaches in RCTs, check out this tool

Sensitivity Analysis

What if data are MNAR?

  • Untestable
  • But can “stress test” model params under MNAR assumptions
  • 2 approaches: pattern mixture model & selection model
  • Untestable
  • But can “stress test” model params under MNAR assumptions
  • 2 approaches: pattern mixture model & selection model

Pattern mixture model

  • Assumption: people missing Y have diff average Y values than those with observed Y
  • Model intercept is combo of the two groups

\[ test_i = \beta_0 + \beta_1 income_i + \epsilon_i \]

\[ \beta_0 = \beta_{0(obs)} \times p_{obs} + \beta_{0(miss)} \times p_{miss} \] \(p\) = sample proportion

How does this assumption affect the intercept?

  • Goal: calculate \(\beta_0\) under this MNAR assumption
  • Compare to \(\beta_0\) from main model (that assumes MAR)

Estimation Step 1:
Add missing indicator to model

\[ test_i = \beta_{0(obs)} + \delta miss_i + \beta_1 income_i + \epsilon_i \]

\(miss_i\) = 1 if missing test data, 0 if observed test data

\(\delta = \beta_{0(miss)} - \beta_{0(obs)}\)

  • With \(\beta_{0(obs)}\) & \(\delta\), we can calculate \(\beta_0\)

\(\beta_0 = \beta_{0(obs)} \times p_{obs} + \beta_{0(miss)} \times p_{miss} \textcolor[rgb]{1.0,1.0,1.0}{(\delta + )}\)

\(\beta_0 = \beta_{0(obs)} \times p_{obs} + (\beta_{0(obs)} + \delta) \times p_{miss}\)

  • Problem: no data to estimate \(\delta\)

Step 2: Plug in difference between groups & compute directly

  • Choose an effect size difference in means between groups, use to compute \(\delta\)

\[ \delta = d \times \sigma_{test} \]

\(d\) = Cohen’s D effect size
\(\sigma_{test}\) = test score SD

\[ \delta = 0.2 \times 3 = 0.6 \]

Step 3: Estimate model with this fixed difference


\(test_i = \beta_{0(obs)} + \delta miss_i + \beta_1 income_i + \epsilon_i \textcolor[rgb]{1.0,1.0,1.0}{(\delta + )}\)

\(test_i = \beta_{0(obs)} + 0.6 \times miss_i + \beta_1 income_i + \epsilon_i\)

  • This gives us estimate of \(\beta_{0(obs)}\)

Step 4: Compute overall intercept


\[ \beta_0 = \beta_{0(obs)} \times p_{obs} + (\beta_{0(obs)} + \delta) \times p_{miss} \]

  • Compare to main imputation/FIML model \(\beta_0\) to see sensitivity to MNAR process
  • Check sensitivity to higher/lower effect size differences

Same process for testing slope sensitivity

\[ test_i = \beta_0 + \beta_1 treat_i + \beta_2 income_i + \epsilon_i \]

Interact missing indicator w/ variable whose slope you want to test

\[\begin{align} test_i = & \textcolor[rgb]{0.20,0.70,0.20}{\beta_{0(obs)} + \delta_0 miss_i} + \textcolor[rgb]{0.00,0.00,1.00}{\beta_{1(obs)} treat_i +} \\ & \textcolor[rgb]{0.00,0.00,1.00}{\delta_1 miss_i \times treat_i} + \beta_2 income_i + \epsilon_i \end{align}\]

\(\textcolor[rgb]{0.20,0.70,0.20}{\delta_0}\) = ctrl group, missing vs observed Y mean diff
\(\textcolor[rgb]{0.00,0.00,1.00}{\delta_1}\) = treat group, missing vs observed ATE diff

ATE: Diff between new treat & control group mixture means

Observed Mean Missing Mean
Ctrl \(\textcolor[rgb]{0.20,0.70,0.20}{\beta_{0(obs)}}\) \(\textcolor[rgb]{0.20,0.70,0.20}{\beta_{0(obs)} + \delta_0}\)
Treat \(\textcolor[rgb]{0.00,0.00,1.00}{\beta_{0(obs)} + \beta_{1(obs)}}\) \(\textcolor[rgb]{0.00,0.00,1.00}{\beta_{0(obs)} + \delta_0 + \beta_{1(obs)} + \delta_1}\)

\[ \textcolor[rgb]{0.20,0.70,0.20}{mean_{ctrl}} = \textcolor[rgb]{0.20,0.70,0.20}{\beta_0} = \textcolor[rgb]{0.20,0.70,0.20}{\beta_{0(obs)} \times p_{0(obs)}} + \textcolor[rgb]{0.20,0.70,0.20}{(\beta_{0(obs)} + \delta_0) \times p_{0(miss)}} \]

\[\begin{align} \textcolor[rgb]{0.00,0.00,1.00}{mean_{treat}} = & \textcolor[rgb]{0.00,0.00,1.00}{(\beta_{0(obs)} + \beta_{1(obs)}) \times p_{1(obs)}} + \\ & \textcolor[rgb]{0.00,0.00,1.00}{(\beta_{0(obs)} + \delta_0 + \beta_{1(obs)} + \delta_1) \times p_{1(miss)}} \end{align}\]

\[ ATE_{MNAR} = \textcolor[rgb]{0.00,0.00,1.00}{mean_{treat}} - \textcolor[rgb]{0.20,0.70,0.20}{mean_{ctrl}} \]

Conclusion

None of these methods are perfect

  • All make untestable assumptions
  • Important to think carefully about potential missingness mechanisms
  • Take multiple approaches based on different plausible assumptions

…and let me know if you want some help!

References

  1. van Buuren, S. (2018). 2.7 When not to use multiple imputation. Flexible Imputation of Missing Data. https://stefvanbuuren.name/fimd/sec-when.html

  2. Enders, C. K. (2023). Missing data: An update on the state of the art. Psychological Methods. https://doi.org/10.1037/met0000563