Reputation: 316
I have one millions sample and there are about 1000 features. However, only a subset of features are measured for each sample. I want to perform machine learning to predict the result based on the features, however, I do not know how to handle the missing data. Since data are missing in random order, I cannot classify data based on the missing feature because the number of classes would be huge and there would be only few samples in each class. What is the best solution for handling this kind of problem?
Upvotes: 1
Views: 3466
Reputation: 2628
Your problem is a a common case in data analysis in machine learning. While it is hard to tell how to resolve your problem exactly - without knowing the data, what you want to predictice, or the models you are thinking about (e.g. generative or dirscriminative) - I will try to give you some pointers.
References
First, some references: I found (Benjamin Marlin's PhD Thesis](http://www.cs.ubc.ca/~bmarlin/research/phd_thesis/marlin-phd-thesis.pdf) to be a good place to start. I haven't read the full PhD thesis but came across it a couple. It might be useful to give you a quick start on the matter. There is also a book on "Statistical Analysis with Missing Data" by Little and Rubin that might be useful for you. There is a vast body of ltierature on the topic, this review may help you to get an overview: A Review of Methods for Missing Data (the review exemplarily discusses a research study for regarding asthma symptoms, but the approaches may still be useful to you). Beneath the literature, there is also a Wikipedia page on Missing Data that might provide some basic insights.
Summary
Some simple approaches to get you started:
Overall, there are many valid approaches and it depends strongly on your task/application. Still, start by determining why the data is missing and what data is missing. Then, follow some of the references and start trying out simple approaches to see what works for you.
Upvotes: 2
Reputation: 12609
Methods to treat missing values
1. Deletion:
It is of two types: List Wise Deletion and Pair Wise Deletion.
In list wise deletion, we delete observations where any of the variable is missing. Simplicity is one of the major advantage of this method, but this method reduces the power of model because it reduces the sample size.
In pair wise deletion, we perform analysis with all cases in which the variables of interest are present. Advantage of this method is, it keeps as many cases available for analysis. One of the disadvantage of this method, it uses different sample size for different variables.
Deletion methods are used when the nature of missing data is “Missing completely at random” else non random missing values can bias the model output.
2. Mean/ Mode/ Median Imputation:
Imputation is a method to fill in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating the missing values. Mean / Mode / Median imputation is one of the most frequently used methods. It consists of replacing the missing data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable. It can be of two types:-
Generalized Imputation: In this case, we calculate the mean or median for all non missing values of that variable then replace missing value with mean or median. Like in above table, variable “Manpower” is missing so we take average of all non missing values of “Manpower” (28.33) and then replace missing value with it.
Similar case Imputation: In this case, we calculate average for gender “Male” (29.75) and “Female” (25) individually of non missing values then replace the missing value based on gender. For “Male“, we will replace missing values of manpower with 29.75 and for “Female” with 25.
3. Prediction Model:
Prediction model is one of the sophisticated method for handling missing data. Here, we create a predictive model to estimate values that will substitute the missing data. In this case, we divide our data set into two sets: One set with no missing values for the variable and another one with missing values. First data set become training data set of the model while second data set with missing values is test data set and variable with missing values is treated as target variable. Next, we create a model to predict target variable based on other attributes of the training data set and populate missing values of test data set.We can use regression, ANOVA, Logistic regression and various modeling technique to perform this. There are 2 drawbacks for this approach:
The model estimated values are usually more well-behaved than the true values
If there are no relationships with attributes in the data set and the attribute with missing values, then the model will not be precise for estimating missing values.
4. KNN Imputation:
In this method of imputation, the missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined using a distance function. It is also known to have certain advantage & disadvantages.
Advantages:
k-nearest neighbour can predict both qualitative & quantitative attributes
Creation of predictive model for each attribute with missing data is not required
Attributes with multiple missing values can be easily treated
Correlation structure of the data is taken into consideration
Disadvantage:
KNN algorithm is very time-consuming in analyzing large database. It searches through all the dataset looking for the most similar instances.
Choice of k-value is very critical. Higher value of k would include attributes which are significantly different from what we need whereas lower value of k implies missing out of significant attributes.
Source: https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
Upvotes: 4