user3562812
user3562812

Reputation: 1829

How do I select row / columns with more than certain number of missing data in Pandas?

I am trying to learn how to select rows or columns with more than certain number of missing data. Using value_counts, I can find the column that meets the condition, but I can't figure out how to retrieve index in int form so that I can access and remove the corresponding column from the dataframe.

How can I convert / retrieve index in int? And, is there easier / simpler way to do this?

  df3 = pandas.DataFrame([[1,6.5,3],[1,'NA','NA'],[3,'NA','NA'],['NA',6.5,'NA']])
    df3_value_counts = df3.apply(pandas.value_counts).fillna(0).ix['NA']
    df3_missing_data_index = df3_value_counts[df3_value_counts > 2].index

Upvotes: 2

Views: 499

Answers (2)

Divakar
Divakar

Reputation: 221614

Here's one approach -

df3.columns[(df3.values == 'NA').sum(0) > 2]

Alternatively and probably a bit slower, but compact one -

df3.columns[(df3 == 'NA').sum(0) > 2]

Sample run -

In [292]: df3
Out[292]: 
    0    1   2
0   1  6.5   3
1   1   NA  NA
2   3   NA  NA
3  NA  6.5  NA

In [293]: df3.columns[(df3.values == 'NA').sum(0) > 2]
Out[293]: Int64Index([2], dtype='int64')

Upvotes: 0

piRSquared
piRSquared

Reputation: 294468

first off, you're much better off using things that represent NaN

df3 = df3.replace('NA', np.nan).astype(float)

Then you can use built in methods to do what you want

df3.dropna(axis=1, thresh=2)

enter image description here

Upvotes: 5

Related Questions