Pandas Groupby with multiple columns selecting rows with full range of values

Question

I am working with a pandas dataframe. From the code:

contracts.groupby(['State','Year'])['$'].mean()

I have a pandas groupby object with two group layers: State and Year.

State / Year / $
NY     2009   5
       2010   10
       2011   5
       2012   15
NJ     2009   2
       2012   12
DE     2009   1
       2010   2
       2011   3
       2012   6

I would like to look at only those states for which I have data on all the years (i.e. NY and DE, not NJ as it is missing 2010). Is there a way to suppress those nested groups with less than full rank?

unutbu · Accepted Answer

After grouping by State and Year and taking the mean,

means = contracts.groupby(['State', 'Year'])['$'].mean()

you could groupby the State alone, and use filter to keep the desired groups:

result = means.groupby(level='State').filter(lambda x: len(x)>=len(years))

For example,

import numpy as np
import pandas as pd
np.random.seed(2015)
N = 15

states = ['NY','NJ','DE']
years = range(2009, 2013)
contracts = pd.DataFrame({
    'State': np.random.choice(states, size=N),
    'Year': np.random.choice(years, size=N), 
    '$': np.random.randint(10, size=N)})

means = contracts.groupby(['State', 'Year'])['$'].mean()
result = means.groupby(level='State').filter(lambda x: len(x)>=len(years))

print(result)

yields

State  Year
DE     2009    8
       2010    5
       2011    3
       2012    6
NY     2009    2
       2010    1
       2011    5
       2012    9
Name: $, dtype: int64

Alternatively, you could filter first and then take the mean:

filtered = contracts.groupby(['State']).filter(lambda x: x['Year'].nunique() >= len(years))
result = filtered.groupby(['State', 'Year'])['$'].mean()

but playing with various examples suggest this is typically slower than taking the mean, then filtering.

Pandas Groupby with multiple columns selecting rows with full range of values

Answers (1)

Related Questions