Reputation: 7644
i have a dataframe
id store val1 val2
1 abc 20 30
1 abc 20 40
1 qwe 78 45
2 dfd 34 45
2 sad 43 45
from this i have to group by on id
and create a new df, with column, total_store
and unique stores
and non-repeating_stores
, which contains count of such store occurances.
my final output should be
id total_store unique stores non-repeating_stores
1 3 2 1
2 2 2 2
i can get total stores by doing
df.groupby('id')['store'].count()
But how do i get others and form a dataframe out of it
Upvotes: 2
Views: 2362
Reputation: 402333
You can use a groupby
+ agg
.
df = df.groupby('id').store.agg(['count', 'nunique', \
lambda x: x.drop_duplicates(keep=False).size])
df.columns = ['total_store', 'unique stores', 'non-repeating_stores']
df
total_store unique stores non-repeating_stores
id
1 3 2 1
2 2 2 2
For older pandas versions, passing a dict allows simplifying your code (deprecated in 0.20
and onwards):
agg_funcs = {'total_stores' : 'count', 'unique_stores' : 'nunique',
'non-repeating_stores' : lambda x: x.drop_duplicates(keep=False).size
}
df = df.groupby('id').store.agg(agg_funcs)
df
total_stores non-repeating_stores unique_stores
id
1 3 1 2
2 2 2 2
As a slight improvement with speed, you can employ the use of drop_duplicates
' sister method, duplicated
, in this fashion, as documented by jezrael:
lambda x: (~x.duplicated(keep=False)).sum()
This would replace the third function in agg
, with a 20% speed boost over large data of size 1000000
:
1 loop, best of 3: 7.31 s per loop
v/s
1 loop, best of 3: 5.19 s per loop
Upvotes: 3
Reputation: 862511
Use groupby
with agg
with count
and nunique
. Last function is a bit complicated - need count all non dupes using inverting duplicated
with sum
:
If need count NaN
s use size
instead count
:
df = df.groupby('id')['store'].agg(['count',
'nunique',
lambda x: (~x.duplicated(keep=False)).sum()])
df.columns = ['total_store', 'unique stores', 'non-repeating_stores']
print (df)
total_store unique stores non-repeating_stores
id
1 3 2 1
2 2 2 2
Timings:
np.random.seed(123)
N = 1000000
L = np.random.randint(10000,size=N).astype(str)
df = pd.DataFrame({'store': np.random.choice(L, N),
'id': np.random.randint(10000, size=N)})
print (df)
In [120]: %timeit (df.groupby('id')['store'].agg(['count', 'nunique', lambda x: (~x.duplicated(keep=False)).sum()]))
1 loop, best of 3: 4.47 s per loop
In [122]: %timeit (df.groupby('id').store.agg(['count', 'nunique', lambda x: x.drop_duplicates(keep=False).size]))
1 loop, best of 3: 11 s per loop
Upvotes: 2