Reputation: 51
I want to drop entire levels (in this case, countries) in my hierarchical index if ANY of the data values for that country are NaN. So I want to go from something like this:
M1 M2
country year
Arab World 2010 5.240002 NaN
2009 NaN NaN
Bangladesh 2010 6.206065 3.7
2009 5.708707 NaN
Canada 2010 7.203803 5.8
2009 6.144833 7.0
Sweden 2010 9.123140 6.0
2009 5.213283 6.1
to something like this:
M1 M2
country year
Canada 2010 7.203803 5.8
2009 6.144833 7.0
Sweden 2010 9.123140 6.0
2009 5.213283 6.1
I've tried df.dropna()
with the thresh
option, as well as df.fillna(0)
to try and make the deletion of countries easier, but both methods are built to leave levels in the DataFrame if M1 and M2 contain values, e.g. Bangladesh in 2010.
Are there any concise ways to solve this problem?
Upvotes: 1
Views: 497
Reputation: 2320
Yes there is a concise and efficient way to solve this. You were on the right track with df.dropna()
, just that you need to unstack
your data before you apply it.
>>> print df
M1 M2
Country Year
Arab World 2009 NaN NaN
2010 5.240002 NaN
Bangladesh 2009 5.708707 NaN
2010 6.206065 3.7
Canada 2009 6.144833 7.0
2010 7.203803 5.8
Sweden 2009 5.213283 6.1
2010 9.123140 6.0
pivot the DataFrame
to make "Year" the inner most column labels
>>> df1 = df.unstack(level=-1)
remove rows with missing data
>>> df2 = df1.dropna()
reverse the unstacking
>>> print df2.stack()
M1 M2
Country Year
Canada 2009 6.144833 7.0
2010 7.203803 5.8
Sweden 2009 5.213283 6.1
2010 9.123140 6.0
Put all this together:
>>> clean = df.unstack(level=-1).dropna().stack()
Upvotes: 2
Reputation: 5121
The three lines at the bottom of this code block do the heavy lifting, the rest gets your data into a DataFrame (more or less).
# get data
data="""
country year M1 M2
Arab_World 2010 5.240002 NaN
Arab_World 2009 NaN NaN
Bangladesh 2010 6.206065 3.7
Bangladesh 2009 5.708707 NaN
Canada 2010 7.203803 5.8
Canada 2009 6.144833 7.0
Sweden 2010 9.123140 6.0
Sweden 2009 5.213283 6.1"""
from StringIO import StringIO # import from io for python 3
df = pd.read_csv(StringIO(data), header=0, index_col=['country', 'year'], sep=r'\s+')
# manipulate rows
to_drop = df.groupby(level='country').apply(lambda x: x.isnull().any().any())
df = df.reset_index(level=0)
keepers = df[(~to_drop[df.country]).tolist()]
Yields
In [13]: print(keepers)
country M1 M2
year
2010 Canada 7.203803 5.8
2009 Canada 6.144833 7.0
2010 Sweden 9.123140 6.0
2009 Sweden 5.213283 6.1
Upvotes: 2