TechGeek
TechGeek

Reputation: 1

What are the standard ways of filling missing values in python?

I have a very limited dataset having variety of columns having missing values. I can not prune the rows having missing values as it will reduce the size drastically. Can anyone suggest, standard procedure for it ?

Upvotes: 0

Views: 298

Answers (3)

Swazy
Swazy

Reputation: 398

What you are describing is called imputation and there are lots of interesting ways to deal with the situation. For numerical variables you can fill the missing values with the feature's mean or mode for instance. In the case of categorical variables you can make a missing value a category in itself or simply replace it with the most common category. There isn't really one correct way of doing it. Sometimes people use cases where the data isn't missing to try to predict the values of the missing cases!

In the case of Python specifically, Scikit-learn has some nice methods designed to help with this here and here.

Its worth mentioning that these methods all lie on a spectrum from the very simple to the very sophisticated and you have to decided what approach is most appropriate for your situation. On the much higher end of sophistication there are ways you can build statistical models of the "data going missing" process and then find the most "likely" underlying values of the missing data given this most "likely" model. This might give you a flavour. I think this is often overkill though!

Upvotes: 0

EakzIT
EakzIT

Reputation: 632

If you just need place holder there are couple of ways of dealing with it. I prefer defaultdict from collections module.

from collections import defaultdict

dict = {1:'one',2:'two',3:'three',4:'four'}
dict2=defaultdict(int)
# defaultdict(<class 'int'>, {})
dict2.update(dict)
# defaultdict(<class 'int'>, {1: 'one', 2: 'two', 3: 'three', 4: 'four'})
dict[5]
# 0
dict2
# defaultdict(<class 'int'>, {1: 'one', 2: 'two', 3: 'three', 4: 'four', 5: 0})

Else you can use if constructors or any other place holders.

Upvotes: 0

Amit Chauhan
Amit Chauhan

Reputation: 31

To fill the missing values, you can do one of the following: 1) Compute the mean of the feature using the available values and then fill the missing values with the mean. If the values are discrete (categorical), then use the most frequent value (mode) to fill the missing ones. 2) Find the most similar example(s) to the one that has a missing value given that these examples have a value for the particular feature. Then use their mean/mode along the feature you’re interested in to fill the missing values.

Upvotes: 1

Related Questions