Kalpit
Kalpit

Reputation: 923

Dealing with missing values in dataset in python

How to choose whether to drop the NaN values or fill it with mean(or median) in a dataset? And what are other techniques to clean the missing values in a dataset? Let the data be numbers.(in python)

Upvotes: 2

Views: 1826

Answers (2)

Sagar Dawda
Sagar Dawda

Reputation: 1166

There is no specific rule for dealing with missing data. However here are some things you may want to consider:

1. If the data for a column has over 70% missing values, you may want to drop that column.

2. If the distribution for the column data is symmetric in nature, you could consider replacing missing values with mean:

test = pd.DataFrame({'A': [1, 2, np.nan, 3, 4, 7, 11], 'B': [1, 4, 5, 7, 12, 45, 6], 'Group':['c', 'd', 'd', 'c', 'd', 'c', 'd']})

test
    A       B   Group
0   1.0     1   c
1   2.0     4   d
2   NaN     5   d
3   3.0     7   c
4   4.0     12  d
5   7.0     45  c
6   11.0    6   d

test['A'].fillna(test['A'].mean(), inplace=True)

test
    A           B   Group
0   1.000000    1   c
1   2.000000    4   d
2   4.666667    5   d
3   3.000000    7   c
4   4.000000    12  d
5   7.000000    45  c
6   11.000000   6   d

OR you could group the data and use the grouped mean:

test['A'].fillna(test.groupby('Group')['A'].transform('mean'), inplace=True)
test
    A           B   Group
0   1.000000    1   c
1   2.000000    4   d
2   5.666667    5   d
3   3.000000    7   c
4   4.000000    12  d
5   7.000000    45  c
6   11.000000   6   d

3. If the data for the column is skewed, you may consider using median for filling the missing values. (Replace 'mean' with 'median' in above command).

4. Alternatively you could also look at an unsupervised approach like clustering. Here once your data is clustered, you could use the mode value or a mean value of the cluster and replace your missing data accordingly.

Hope this helps.

Upvotes: 2

Joe
Joe

Reputation: 12417

It always depends from your dataset and the percentage of missing values.

For a small percentage of missing values, drop the NaN values is an acceptable solution. If the percentage is not negligible, then drop the NaN is strongly discouraged. Then the filling typology depends on the type of data. If your missing values should be in a known and small range, then you can fill with a mean of the other values. For example if your dataset includes the age of students in a school(but many of those values are missing), an average of values shouldn't create problems for certain analysis. If on the other hand you have a sequence of increasing measurements which are slow in time, you could think to replace the NaN values with forward or backward filling. For example in the situation below, df.fillna(method='ffill') should be better than df.fillna(df.mean()):

                    A       
01-01-2018 00:00  0.1   
01-01-2018 00:01  0.1   
01-01-2018 00:02  NaN   
01-01-2018 00:03  0.1  
01-01-2018 00:04  0.2  
01-01-2018 00:05  0.2  

But in this other example replace with average could better:

             Age    Class
StudentA    15.3       10   
StudentB    16.1       10
StudentC    15.5        9
StudentD     NaN       10
StudentE    16.0       10

Again, there is not a general rule, but it depends from your dataset and the analysis that you have to do.

Upvotes: 2

Related Questions