Reputation: 923
How to choose whether to drop the NaN values or fill it with mean(or median) in a dataset? And what are other techniques to clean the missing values in a dataset? Let the data be numbers.(in python)
Upvotes: 2
Views: 1826
Reputation: 1166
There is no specific rule for dealing with missing data. However here are some things you may want to consider:
test = pd.DataFrame({'A': [1, 2, np.nan, 3, 4, 7, 11], 'B': [1, 4, 5, 7, 12, 45, 6], 'Group':['c', 'd', 'd', 'c', 'd', 'c', 'd']})
test
A B Group
0 1.0 1 c
1 2.0 4 d
2 NaN 5 d
3 3.0 7 c
4 4.0 12 d
5 7.0 45 c
6 11.0 6 d
test['A'].fillna(test['A'].mean(), inplace=True)
test
A B Group
0 1.000000 1 c
1 2.000000 4 d
2 4.666667 5 d
3 3.000000 7 c
4 4.000000 12 d
5 7.000000 45 c
6 11.000000 6 d
test['A'].fillna(test.groupby('Group')['A'].transform('mean'), inplace=True)
test
A B Group
0 1.000000 1 c
1 2.000000 4 d
2 5.666667 5 d
3 3.000000 7 c
4 4.000000 12 d
5 7.000000 45 c
6 11.000000 6 d
Hope this helps.
Upvotes: 2
Reputation: 12417
It always depends from your dataset and the percentage of missing values.
For a small percentage of missing values, drop the NaN
values is an acceptable solution. If the percentage is not negligible, then drop the NaN
is strongly discouraged.
Then the filling typology depends on the type of data. If your missing values should be in a known and small range, then you can fill with a mean of the other values. For example if your dataset includes the age of students in a school(but many of those values are missing), an average of values shouldn't create problems for certain analysis.
If on the other hand you have a sequence of increasing measurements which are slow in time, you could think to replace the NaN
values with forward or backward filling.
For example in the situation below, df.fillna(method='ffill')
should be better than df.fillna(df.mean())
:
A
01-01-2018 00:00 0.1
01-01-2018 00:01 0.1
01-01-2018 00:02 NaN
01-01-2018 00:03 0.1
01-01-2018 00:04 0.2
01-01-2018 00:05 0.2
But in this other example replace with average could better:
Age Class
StudentA 15.3 10
StudentB 16.1 10
StudentC 15.5 9
StudentD NaN 10
StudentE 16.0 10
Again, there is not a general rule, but it depends from your dataset and the analysis that you have to do.
Upvotes: 2