Reputation: 166
I'm trying to drop columns that have too many missing values. How can I count the occurrence of some values within columns since the missing values are represented using 99 or 90
here is the code that is supposed to drop columns that exceed the threshold value
threshold = 0.6
data = data[data.columns[[data.column == 90 or data.column == 99].count().mean() < threshold]]
I'm not quite used to using pandas, any suggestions would be helpful
Upvotes: 1
Views: 193
Reputation: 260640
You're almost there. Use apply
:
threshold = 0.6
out = data[data.apply(lambda s: s.isin([90, 99])).mean(1).lt(threshold)]
Example input:
0 1 2 3 4
0 0 90 0 0 0
1 0 0 0 0 0
2 0 90 0 99 0
3 90 0 0 0 0
4 99 99 0 90 99 # to drop
5 99 0 0 0 99
6 0 0 99 0 90
7 0 90 99 0 90 #
8 99 90 0 90 0 #
9 0 99 0 0 0
output:
0 1 2 3 4
0 0 90 0 0 0
1 0 0 0 0 0
2 0 90 0 99 0
3 90 0 0 0 0
5 99 0 0 0 99
6 0 0 99 0 90
9 0 99 0 0 0
Upvotes: 3