Shubham R
Shubham R

Reputation: 7644

Summarize dataframe by grouping on a column with pandas

i have a dataframe

id  store  val1    val2
1    abc    20      30
1    abc    20      40
1    qwe    78      45
2    dfd    34      45
2    sad    43      45

from this i have to group by on id and create a new df, with column, total_store and unique stores and non-repeating_stores, which contains count of such store occurances. my final output should be

id    total_store    unique stores    non-repeating_stores
1        3              2                   1
2        2              2                   2

i can get total stores by doing

df.groupby('id')['store'].count()

But how do i get others and form a dataframe out of it

Upvotes: 2

Views: 2362

Answers (2)

cs95
cs95

Reputation: 402333

You can use a groupby + agg.

df = df.groupby('id').store.agg(['count', 'nunique', \
                lambda x: x.drop_duplicates(keep=False).size])
df.columns = ['total_store', 'unique stores', 'non-repeating_stores']

df    
    total_store  unique stores  non-repeating_stores
id                                                  
1             3              2                     1
2             2              2                     2

For older pandas versions, passing a dict allows simplifying your code (deprecated in 0.20 and onwards):

agg_funcs = {'total_stores' : 'count', 'unique_stores' : 'nunique', 
         'non-repeating_stores' : lambda x: x.drop_duplicates(keep=False).size
}
df = df.groupby('id').store.agg(agg_funcs)

df 
    total_stores  non-repeating_stores  unique_stores
id                                                   
1              3                     1              2
2              2                     2              2

As a slight improvement with speed, you can employ the use of drop_duplicates' sister method, duplicated, in this fashion, as documented by jezrael:

lambda x: (~x.duplicated(keep=False)).sum()

This would replace the third function in agg, with a 20% speed boost over large data of size 1000000:

1 loop, best of 3: 7.31 s per loop 

v/s

1 loop, best of 3: 5.19 s per loop 

Upvotes: 3

jezrael
jezrael

Reputation: 862511

Use groupby with agg with count and nunique. Last function is a bit complicated - need count all non dupes using inverting duplicated with sum:

If need count NaNs use size instead count:

df = df.groupby('id')['store'].agg(['count', 
                                   'nunique', 
                                    lambda x: (~x.duplicated(keep=False)).sum()])
df.columns = ['total_store', 'unique stores', 'non-repeating_stores']
print (df)
    total_store  unique stores  non-repeating_stores
id                                                  
1             3              2                     1
2             2              2                     2

Timings:

np.random.seed(123)
N = 1000000


L = np.random.randint(10000,size=N).astype(str)
df = pd.DataFrame({'store': np.random.choice(L, N),
                   'id': np.random.randint(10000, size=N)})
print (df)

In [120]: %timeit (df.groupby('id')['store'].agg(['count',  'nunique', lambda x: (~x.duplicated(keep=False)).sum()]))
1 loop, best of 3: 4.47 s per loop

In [122]: %timeit (df.groupby('id').store.agg(['count', 'nunique', lambda x: x.drop_duplicates(keep=False).size]))
1 loop, best of 3: 11 s per loop

Upvotes: 2

Related Questions