Featured post

Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

Tuesday 4 August 2015

Lesson Prototype - First Lecture on Multiple Imputation

Continuing the work on my data management course, here's the start of the Imputation unit. I'm intending three units in total - Text processing, principles of data manipulation (cleaning, merging, formatting, and database access), and imputation.




The companion reading for this lesson is Chapter 1 of Flexible Imputation of Missing Data, by Stef van Buuren, which is referenced in some places and is drawn from heavily to make these notes. As this is just a prototype, I have not yet filled in all the details or included the iris_nasty dataset, which is just the iris dataset with random values deleted.

Structure:
1) Motivation for learning imputation
2) Types of missingness
3) A few simple methods that build up towards multiple imputation

--------------------
Hook: The frictionless vacuum of data


Consider the iris dataset in base R:


data(iris)
iris[c(1:3,148:150),]

Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1            5.1         3.5          1.4         0.2    setosa
2            4.9         3.0          1.4         0.2    setosa
3            4.7         3.2          1.3         0.2    setosa
148          6.5         3.0          5.2         2.0 virginica
149          6.2         3.4          5.4         2.3 virginica
150          5.9         3.0          5.1         1.8 virginica

any(is.na(iris))
[1] FALSE

dim(iris)
[1] 150   5


A dataset like this is the equivalent of a frictionless vacuum in a physics problem: ideal for a variety of methods and for teaching principles but not commonly found in 'real' industrial settings. More likely is something like this:

iris_nasty = read.csv("iris_nasty.csv", as.is=TRUE)
 iris_nasty[c(1:3,148:150),]
 
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1             NA         3.5          1.4          NA    setosa
2            4.9          NA           NA         0.2    setosa
3            4.7         3.2          1.3          NA    setosa
148          6.5         3.0          5.2          NA virginica
149          6.2          NA          5.4         2.3 virginica
150          5.9         3.0           NA          NA virginica


Missing values, shown as "NA" (as in Not Applicable) like the ones in the nasty version of the iris dataset above, could come from any manner of problems. In a biological setting, it could be a failure of the measuring device. If it was a live specimen, the measurement may not have been possible without harming the specimen, or trampling a lot of ground in order to reach it.

If this data were from a social survey, missingness can come from questions that were skipped over or refused by the respondent. The data points could also missing because they were not relevant to the respondent, such as questions about prostate cancer being asked to a woman. If the survey was done of the internet, questions could be missed due to a connection problem.

------------
Types of missingness


Missingness patterns are classified into three types: MCAR, MAR, and MNAR.

MCAR: Missing Completely At Random - The chance of any given value being missing has nothing to do with any meaningful variable. This is the least problematic to deal with, because methods of imputation (filling in missing values) can work by imputing on one variable at a time, rather than imputing on the whole dataset at once. Missingness related to internet connection problems are reasonably assumed to be MCAR.

MAR: Missing At Random - The chance of any given value being missing depends ONLY on other variables that are in the dataset. This is more problematic to impute because there is the possibility of cases that are missing two variables in which the missingness of one variable depends on the other, which is itself missing. A possiblity could be the results of someone's most recent prostate exam being missing in a dataset that also includes age, sex, and a few other factors that explain most or all of the reasons why such data wouldn't be available.

MNAR: Missing Not At Random - The chance of any given being missing depends in some way on variables that are NOT included in the dataset. This is by far the most prolematic of missingness patterns because there is no way to know why something is missing, and therefore any assumptions of the missing values are biased and conjecture. Van Buuren, author of "Flexible Imputation of Missing Data" uses an example of a weight scale that is wearing out over time, espcially after heavy objects are weighed, and occasionally failing to measure something. If we don't know what's been weighed first or last, then we can't accurately estimate the chances of any given value being missing.

My favourite example is the 2011 Canadian Long-Form Census, which was voluntary. About a third of the forms that were sent out were not returned, and there is no reasonable way to account for them because certain demographics are less likely to return voluntary surveys. Unfortunately, the information about which demographics these are is exactly the kind of information that is missing.


There are methods for telling MCAR data from MAR data, which we will cover later in this unit. However, there is by definition no way to tell MAR data from MNAR data because that would involve conditioning on information that is not available. Unless there is cause to believe that some data of interest is MNAR, the default assumption is that data is MAR instead.



--------------
Dealing with missing data - building up to multiple imputation


There are many ways to impute the missing data. That is, to replace the NA spaces with values that can be used for analysis. Starting with the most naive and working up:


1) Remove all cases that have a missing value.length(which(is.na(iris_nasty)))

[1] 243

dim(iris_nasty)
[1] 150   5

dim(na.omit(iris_nasty))
[1] 16  5

Although 36% of the data values (243 of 750) are missing, 89% of the cases (134 of 150) contain at least one missing value. To ignore all of those cases means throwing out a lot of data that was collected and recorded. This is not a preferable solution.


2) Remove cases with missing values for the variables used.


mod1 = lm(Petal.Length ~ Petal.Width, data=iris_nasty)
length(mod1$residuals)
[1] 94


mod2 = lm(Petal.Length ~ ., data=iris_nasty)
length(mod2$residuals)
[1] 16

This method of dealing with missing data is the default of the linear model function lm() in R. Depending on the model used, this means losing between 56/150 and 134/150 of the sample. That's at least as good as the 'listwise deletion' option suggested before this. However, a comparability problem has been introduced:

round(mod1$coef,2)

(Intercept) Petal.Width
1.12 2.23


round(mod2$coef)

(Intercept) Sepal.Length Sepal.Width Petal.Width Speciesversicolor
-0.32 0.48 -0.21 0.53 1.67

Speciesvirginica
2.12 


Between these two models, the coefficients are different (try for yourself to see if these differences are significant). From the simple linear model to the full model, the intercept changes from 1.123 to -0.319, and the slope gradient along Petal.Width changes from from 2.23 to 0.52.

If these two models were using the same dataset, we could interpret the differences between these coefficients as the effect of holding the Sepal variables constant and fixing the Species variable in the full model. But the full model was developed using a sample of 16 units, whereas the simple model uses 94 units. Are the differences in the model coefficients attributable to the sample units that were omitted or to the variables that were included, or to both, or to an interaction? There's no way to know.
This confusion also highlights why we need to impute data in the first place, rather than ignore it when it is missing.


3) Mean-value imputation: Take the mean of all known values for a variable, and use that value for all missing values for that variable.

This solves the problems from the row deletion solutions above, in that it would allow the use of all 150 cases in the iris_nasty dataset. However, it fails to reflect any sort of variation in the values being imputed. For example, what if Sepal.Length is different for each species? The mean-value imputed values for Sepal.Length will be the same for all the species, which poorly reflects the trends in the data.

(Figure from van Buuren showing how limiting this is for a simple regression)

4) Model-based imputation.
Van Buuren refers to this as regression-based imputation in Chapter 1, but the bigger picture is that imputed values can be those predicted from a model, such as a regression model.

Like the mean-value imputation, this solution allows all the vases to be used. It also reflects the average trends of the data. However, it does NOT reflect the variance unaccounted for by the model. That is, the uncertainty inherit in the imputed values isn't shown; all the imputed values will be the average of what those values are likely to be.

For each value on its own, taking these model-predicted values as the imputations is ideal. After all, a model prediction reflects our best guess as to what a value really is. However, when you take all the imputed values together, the trends in the data is unrealistically reinforced. If we do this with the missing Petal.Width and Petal.Length values, then the estimation of the correlation between these variables becomes ______. Compare that to the correlation found using the 94 rows of data that don't have missing values: ______ , or of the 150 rows of data in the original iris data set without imputations: ________.


(Figure from van Buuren showing how limiting this is too for a simple regression)
(oorrrrrr... figure of scatterplot of model from iris_nasty)

One set of methodology questions comes up with model-based imputation that we will address in a later lesson: If there are missing values for multiple variables from a case in dataset, which of the variables do you impute first? Do you use the imputed value from one variable to inform predictions for imputations on other variables, or do you only use complete cases as a basis for imputation?


5) Stochastic regression-based imputation.

This is the same as regression or model-based imputation, except that some random noise is added to each predicted value to produce an imputed value.

The distribution of the random noise is determined by the amount of uncertainty in the model. For example, a prediction from a very strong regression would have only a small amount of noise added as compared to a prediction from a weak regression. Also, if a linear regression were the model used for prediction, the noise would come from a Gaussian distribution as per the standard regression assumptions.

Adding this noise makes the distribution of imputed values resemble that of the known values, instead of being underdispersed like it is in simple model-based imputation. However, the results from the imputed dataset are now sensitive to the random seed used to generate the noise, and the strength of correlations is now understated. Using stochastic regression-based imputation, 95% of correlations for model 1 in the iris_nasty dataset were between ____ and ______.

Finally, after the noise is added, the imputed missing values are taken with as much certainty as the directly observed values. There is no distinction between values that were originally missing and those that were not.


6) Multiple imputation.
This method is the one we will be using for the rest of this unit. To do multiple imputation, take predictions of values and add random noise as we did in stochastic regression-based imputation. However, do this independently for several (i.e. 3-10) copies of the dataset.

Each copy of the dataset will have the same values for everything that was observed directly, but different values for anything that was imputed. The average of the imputed values will still be close to the predicted value, but now the differences in the imputed values between datasets reflects the uncertainty we have in these imputed values.

(Below are three copies of those six lines from iris_nasty, but with imputations of the missing values show in bold and red)

(figure of what's going on... one dataset to many, many datasets to many analyses, combined to one)

To analyse such a meta-dataset, we perform the same analysis on each copy of the dataset, and then average the expected values from each result using Rubin's Rules. This is shown as a black-box process for the three imputations of the iris_nasty dataset below.

(example of lm from each imputation)

(rubin.rules() black box for now which spits out parameter estimates and uncertainty)

Rubin's Rules deserve an entire lesson, so that's where we'll pick up next time we talk about imputation.


Previous Lesson Prototype on Text - Regular Expression

No comments:

Post a Comment