Reputation: 1951
Let's say I have:
df = pd.DataFrame({'a' : [1, 2, 3, 4, 5] , 'b' : ['cat_1', 'cat_1', 'cat_2', 'cat_2', 'cat_2']})
I perform a groupby:
df.groupby(['b']).agg(['count', 'median'])
I would like to iterate through the rows that this call returns, for example:
for row in ?:
print(row)
should print something like:
('cat_1', 2, 1.5)
('cat_2', 3, 4)
Upvotes: 0
Views: 9536
Reputation: 164673
You've misunderstood: df.groupby(['b']).agg(['count', 'median'])
returns an in-memory dataframe, not an iterator of groupwise results.
Your result is often expressed in this way:
res = df.groupby('b')['a'].agg(['count', 'median'])
print(res)
# count median
# b
# cat_1 2 1.5
# cat_2 3 4.0
Iterating a dataframe is possible via iterrows
or, more efficiently, itertuples
:
for row in df.groupby('b')['a'].agg(['count', 'median']).itertuples():
print((row.Index, row.count, row.median))
print(res)
# ('cat_1', 2, 1.5)
# ('cat_2', 3, 4.0)
If you are looking to calculate lazily, iterate a groupby
object and perform your calculations on each group independently. For data that fits comfortably in memory, you should expect this to be slower than iterating a dataframe of results.
for key, group in df.groupby('b'):
print((key, group['a'].count(), group['a'].median()))
# ('cat_1', 2, 1.5)
# ('cat_2', 3, 4.0)
If you do face memory issues, consider dask.dataframe
for such tasks.
Upvotes: 7
Reputation: 1951
This will do the trick:
for item in df.groupby(['b']).agg(['count', 'median']).reset_index().values:
# Perform operation on 'item' ...
Upvotes: 0