Martin
Martin

Reputation: 51

Python pandas: How do I drop specific levels in a hierarchical index if any column values are NaN?

I want to drop entire levels (in this case, countries) in my hierarchical index if ANY of the data values for that country are NaN. So I want to go from something like this:

                          M1                   M2
country    year                                   
Arab World 2010           5.240002             NaN
           2009           NaN                  NaN
Bangladesh 2010           6.206065             3.7
           2009           5.708707             NaN
Canada     2010           7.203803             5.8
           2009           6.144833             7.0
Sweden     2010           9.123140             6.0
           2009           5.213283             6.1

to something like this:

                          M1                   M2
country    year                                   
Canada     2010           7.203803             5.8
           2009           6.144833             7.0
Sweden     2010           9.123140             6.0
           2009           5.213283             6.1

I've tried df.dropna() with the thresh option, as well as df.fillna(0) to try and make the deletion of countries easier, but both methods are built to leave levels in the DataFrame if M1 and M2 contain values, e.g. Bangladesh in 2010.

Are there any concise ways to solve this problem?

Upvotes: 1

Views: 497

Answers (2)

Vidhya G
Vidhya G

Reputation: 2320

Yes there is a concise and efficient way to solve this. You were on the right track with df.dropna(), just that you need to unstack your data before you apply it.

>>> print df

                       M1   M2
Country    Year               
Arab World 2009       NaN  NaN
           2010  5.240002  NaN
Bangladesh 2009  5.708707  NaN
           2010  6.206065  3.7
Canada     2009  6.144833  7.0
           2010  7.203803  5.8
Sweden     2009  5.213283  6.1
           2010  9.123140  6.0

pivot the DataFrame to make "Year" the inner most column labels

>>> df1 = df.unstack(level=-1)

remove rows with missing data

>>> df2 = df1.dropna()  

reverse the unstacking

>>> print df2.stack()

                    M1   M2
Country Year               
Canada  2009  6.144833  7.0
        2010  7.203803  5.8
Sweden  2009  5.213283  6.1
        2010  9.123140  6.0

Put all this together:

>>> clean = df.unstack(level=-1).dropna().stack()

Upvotes: 2

Mark Graph
Mark Graph

Reputation: 5121

The three lines at the bottom of this code block do the heavy lifting, the rest gets your data into a DataFrame (more or less).

# get data
data="""
country    year           M1                   M2 
Arab_World 2010           5.240002             NaN
Arab_World 2009           NaN                  NaN
Bangladesh 2010           6.206065             3.7
Bangladesh 2009           5.708707             NaN
Canada     2010           7.203803             5.8
Canada     2009           6.144833             7.0
Sweden     2010           9.123140             6.0
Sweden     2009           5.213283             6.1"""
from StringIO import StringIO # import from io for python 3
df = pd.read_csv(StringIO(data), header=0, index_col=['country', 'year'], sep=r'\s+')

# manipulate rows
to_drop = df.groupby(level='country').apply(lambda x: x.isnull().any().any())
df = df.reset_index(level=0)
keepers = df[(~to_drop[df.country]).tolist()]

Yields

In [13]: print(keepers)
     country        M1   M2
year                       
2010  Canada  7.203803  5.8
2009  Canada  6.144833  7.0
2010  Sweden  9.123140  6.0
2009  Sweden  5.213283  6.1

Upvotes: 2

Related Questions