Johannes Wachs
Johannes Wachs

Reputation: 1310

Pandas Groupby with multiple columns selecting rows with full range of values

I am working with a pandas dataframe. From the code:

contracts.groupby(['State','Year'])['$'].mean()

I have a pandas groupby object with two group layers: State and Year.

State / Year / $
NY     2009   5
       2010   10
       2011   5
       2012   15
NJ     2009   2
       2012   12
DE     2009   1
       2010   2
       2011   3
       2012   6

I would like to look at only those states for which I have data on all the years (i.e. NY and DE, not NJ as it is missing 2010). Is there a way to suppress those nested groups with less than full rank?

Upvotes: 0

Views: 2869

Answers (1)

unutbu
unutbu

Reputation: 879591

After grouping by State and Year and taking the mean,

means = contracts.groupby(['State', 'Year'])['$'].mean()

you could groupby the State alone, and use filter to keep the desired groups:

result = means.groupby(level='State').filter(lambda x: len(x)>=len(years))

For example,

import numpy as np
import pandas as pd
np.random.seed(2015)
N = 15

states = ['NY','NJ','DE']
years = range(2009, 2013)
contracts = pd.DataFrame({
    'State': np.random.choice(states, size=N),
    'Year': np.random.choice(years, size=N), 
    '$': np.random.randint(10, size=N)})

means = contracts.groupby(['State', 'Year'])['$'].mean()
result = means.groupby(level='State').filter(lambda x: len(x)>=len(years))

print(result)

yields

State  Year
DE     2009    8
       2010    5
       2011    3
       2012    6
NY     2009    2
       2010    1
       2011    5
       2012    9
Name: $, dtype: int64

Alternatively, you could filter first and then take the mean:

filtered = contracts.groupby(['State']).filter(lambda x: x['Year'].nunique() >= len(years))
result = filtered.groupby(['State', 'Year'])['$'].mean()

but playing with various examples suggest this is typically slower than taking the mean, then filtering.

Upvotes: 2

Related Questions