A Beginner’s Guide to Multivariate Imputation

Published in

Analytics Vidhya

6 min readMar 21, 2021

Source: https://atrium.ai/resources/learn-from-the-experts-the-consequences-of-missing-data/

Missing data is one of the most common problems a data scientist encounters in data analysis. A a couple of quick solutions for dealing with missing values are “remove the observations with missing values from the dataset” or “fill in the missing values with the mean, median, or mode”. However, how good are these quick fixes? Can we do better? In this article, I am going to (1)give a quick introduction to the different types of missing values, (2)visualize missing values, (3)implement multivariate imputation with scikit-learn, (4) test imputed datasets, and (5) draw conclusions.

Categorizing missing data

Missing values can be separated into three categories (1) missing completely at random (mcar), (2) missing at random (mar), and (3) missing not at random (mnar). The way I would like to think about them is — when values are mcar, the missing values have no relationship with observed or missing values. For example, patients miss their appointments due to their own different reasons, such as mis-remembering appointment times, having to pick up kids, etc. When values are mar, the missing values relate to observed data, but not to missing data in the data set. In other words, the theoretical values for the missing slots do NOT cause them missing. For example, multiple patients could miss their appointments due to a snow storm in the region. Finally, when values are mnar, the missing values relate to their theoretical values or to another unobserved variable. For example, patients could miss follow-up appointments because the initial treatment caused adverse effects that made them feel too sick to go to the appointment, and so the adverse effects are missing for some patients.

In the cases of mcar and mar, we could remove or impute the missing values, but there is no good way to deal with mnar. Assume we have looked into why data is missing and come to the conclusion that the missing values we are dealing with belong to the first two cases, should we just remove them or impute them with the mean, median, or mode? Or is there better way to fill in the missing values?

If we just remove missing values from our data set, we are losing valuable information. It might not be a concern when the data set is large (hundreds of observations and missing values consist of a small portion), but it could be problematic when you are working with a small data set (such as the iris data set I used as an example, which has 150 observations). What about imputing the missing values of a variable with the mean, median, or mode of the same variable (the technical term here is univariate imputation)? Imagine you have a data set on housing price and two of the variables in the data set are size in square-feet and number of bedroom. If on average, the size of a house is 1,500 square-feet and the number of bedrooms is two. Doesn’t it feel weird to fill in the number of bedroom for a 5,000 square-feet mansion with a two? It turns out, we have the option to fill in the missing values based on information from other variables in the data set (multivariate imputation). This could be easily done with IterativeImputer from scikit-learn.

Visualizing missing data (yep, we can see them even when they are missing)

To demonstrate the imputation process, I use the iris data set from scikit-learn.

iris = load_iris() 
df = pd.DataFrame(data = np.c_[iris[“data”], iris[“target”]],\   columns = iris[“feature_names”] + [“species”])

It’s a small data set containing 150 observations and four variables, including sepal length, sepal width, petal length, and petal width. The target variable is the species of the irises. The data set comes with no missing value, so I randomly chose to introduce 15 and 10 missing values to sepal length and petal width.

np.random.seed(123)
mask = np.random.randint(0, 150, size = 15)
mask2 = np.random.randint(0, 150, size = 10)df_miss["sepal length (cm)"][mask] = np.nan
df_miss["petal width (cm)"][mask2] = np.nan

To visualize the missing values in the data set, we can use the matrix function from the missingno module.

import missingno as msno
msno.matrix(df_miss, figsize=(10, 6))

Each of the five black rectangular blocks represents a variable from the data set, and the white strips inside the first and the fourth blocks represent the missing values for the variables. The heatmap function from missingno shows the correlation between the missing values of the variables. It helps us understand whether the missing values of different variables relate to each other and to what extent.

Imputing missing data with IterativeImputer

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputerimputer = IterativeImputer(max_iter=10, random_state=123)
# pull out non-target variables for imputation
df.iloc[:, 0:4] = imputer.fit_transform(df_miss.iloc[:, 0:4])

When we impute the missing values, we first need to remove the target variable from the data set. Specifically in the iris data set, it’s the species of the irises. The reason is that we don’t want to use the target variable to estimate the missing values of the other variables, and then use the imputed data set to predict the target when we build our classification model. Using only a few lines of code (as shown in the previous code block), we now have a complete data set, with the missing values being estimated by a regressor fitted on the non-target variables in the data set. One thing I want to point out is that the default regressor for IterativeImputer is BayesianRidge, but you can easily specify your regressor of interest, which I will show in the next code block. More details of the IterativeImputer can be found here.

# initialize the RandomForest regressor
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=10, random_state=123)# specify estimator to be rf 
rf_imputer = IterativeImputer(estimator=rf, random_state=123)
df.iloc[:, 0:4] = rf_imputer.fit_transform(df_miss.iloc[:, 0:4])

Testing the imputed data sets

Now we have our two imputed data sets, the original data set (directly imported from scikit-learn, with no missing value), and also the data set with missing values removed, we could see how these data sets perform in model fitting.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_splitdatasets = [df, drop_missing, df_br, df_rf]
dfnames = ["original", "drop missing", "BayesianRidge imputed", "RandomForest imputed"]logit = LogisticRegression(C = 0.9, max_iter=200)
for i in range(len(datasets)):
    X_train, X_test, y_train, y_test = \ train_test_split(datasets[i].iloc[:, 0:4], datasets[i]["species"], \ test_size=0.3, random_state=123)
    logit.fit(X_train, y_train)
    print(f"Scores for the {dfnames[i]} dataset are")
    print("Training: {:6.2f}%".format(100*logit.score(X_train, y_train)))
    print("Test: {:6.2f}%".format(100*logit.score(X_test, y_test)))

I did a simple train-test split with all the data sets and then fit a logistic regression model to classify the iris species. Here is the result:

As we could see from the result, model performance on the original data set and the two imputed data set are identical, with training accuracy 98.10% and test accuracy 93.33%. However, when we simply removed missing values, we have training accuracy 96.63% and test accuracy 97.44%. The focus here is not how high/low the accuracy is, but whether the data sets have similar performance in model building. It’s obvious in this example that the imputed data sets have more similar performance to the original data set.

Conclusion

From the example above, we showed that multivariate imputation can be an easy-to-implement and effective way to deal with missing data. I hope you find this article helpful, and you would try to implement multivariate imputation in your data analysis routine if it’s not already part of it.

Thank you for reading through this article! I would love to hear your feedback. The code for this article can be found here. The link to my Github can be found here.