Reputation: 1829
I am trying to learn how to select rows or columns with more than certain number of missing data. Using value_counts, I can find the column that meets the condition, but I can't figure out how to retrieve index in int form so that I can access and remove the corresponding column from the dataframe.
How can I convert / retrieve index in int? And, is there easier / simpler way to do this?
df3 = pandas.DataFrame([[1,6.5,3],[1,'NA','NA'],[3,'NA','NA'],['NA',6.5,'NA']])
df3_value_counts = df3.apply(pandas.value_counts).fillna(0).ix['NA']
df3_missing_data_index = df3_value_counts[df3_value_counts > 2].index
Upvotes: 2
Views: 499
Reputation: 221614
Here's one approach -
df3.columns[(df3.values == 'NA').sum(0) > 2]
Alternatively and probably a bit slower, but compact one -
df3.columns[(df3 == 'NA').sum(0) > 2]
Sample run -
In [292]: df3
Out[292]:
0 1 2
0 1 6.5 3
1 1 NA NA
2 3 NA NA
3 NA 6.5 NA
In [293]: df3.columns[(df3.values == 'NA').sum(0) > 2]
Out[293]: Int64Index([2], dtype='int64')
Upvotes: 0
Reputation: 294468
first off, you're much better off using things that represent NaN
df3 = df3.replace('NA', np.nan).astype(float)
Then you can use built in methods to do what you want
df3.dropna(axis=1, thresh=2)
Upvotes: 5