Reputation: 207
I have a dataframe which I am grouping as follows and applying certain operations to particular columns:
df = df.groupby(['A', 'B', 'C']).agg({'ID': 'count', 'AMT': 'sum'})
For each groupby combination (~15) I want to randomly sample the rows belonging to each combination and return a sample ID and report it in a third output column. OR I really just want one of the IDs from the rows in that combination to appear in the table, I don't care if it is 'random' or not.
I have tried the following:
df = df.groupby(['A', 'B', 'C']).agg({'ID': 'count', 'AMT': 'sum', 'ID': 'sample'})
and received the error:
AttributeError: Cannot access callable attribute 'sample' of 'SeriesGroupBy' objects, try using the 'apply' method
So I then tried:
func = lambda x: x.sample
df = df.groupby(['A', 'B', 'C']).agg({'ID': 'count', 'AMT': 'sum', 'ID': apply(func)})
which didnt work so I tried
df = df.groupby(['A', 'B', 'C']).agg({'ID': 'count', 'AMT': 'sum', 'ID': lambda x: x.sample})
which also didn't work. I have reviewed the following links for related questions but they did not seem to help me either.
Select multiple groups from pandas groupby object
http://pandas.pydata.org/pandas-docs/stable/groupby.html
Get specific element from Groups after applying groupby - PANDAS
How to access pandas groupby dataframe by key
https://chrisalbon.com/python/pandas_apply_operations_to_dataframes.html
Any thoughts on how to handle?
Upvotes: 2
Views: 642
Reputation: 294258
Consider the example dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(dict(
A=list('x'*8 + 'y'*8 + 'z'*8) * 2,
B=list('x'*4 + 'y'*4 + 'z'*4) * 4,
C=list('x'*2 + 'y'*2 + 'z'*2) * 8,
ID=range(48),
AMT=np.random.rand(48)
))
print(df)
A AMT B C ID
0 x 0.444939 x x 0
1 x 0.407554 x x 1
2 x 0.460148 x y 2
3 x 0.465239 x y 3
4 x 0.462691 y z 4
5 x 0.016545 y z 5
6 x 0.850445 y x 6
7 x 0.817744 y x 7
8 y 0.777962 z y 8
9 y 0.757983 z y 9
...
39 y 0.778883 x y 39
40 z 0.651676 y z 40
41 z 0.136097 y z 41
42 z 0.544838 y x 42
43 z 0.035073 y x 43
44 z 0.275079 z y 44
45 z 0.706685 z y 45
46 z 0.713614 z z 46
47 z 0.776050 z z 47
#1
You can use np.random.choice
to pick one
f = dict(
ID=dict(Count='count', Sample=np.random.choice),
AMT=dict(Sum='sum', Max='max', Min='min')
)
df.groupby(['A', 'B', 'C']).agg(f)
AMT ID
Sum Max Min Sample Count
A B C
x x x 2.458188 0.866059 0.407554 25 4
y 1.993843 0.691271 0.377185 27 4
y x 3.070036 0.850445 0.700900 7 4
z 1.139663 0.462691 0.016545 28 4
y x x 2.824838 0.926879 0.253200 13 4
y 2.166114 0.778883 0.117642 39 4
z y 2.351120 0.796487 0.018688 8 4
z 3.367248 0.934829 0.700566 10 4
z y x 1.118176 0.544838 0.035073 19 4
z 1.133523 0.651676 0.136097 16 4
z y 1.870361 0.706685 0.275079 44 4
z 2.412484 0.836997 0.085823 47 4
#2
pick more than 1
You can use pd.DataFrame.sample
to take whole slices of df
df.groupby(['A', 'B', 'C']).apply(pd.DataFrame.sample, n=2)
A AMT B C ID
A B C
x x x 25 x 0.866059 x x 25
0 x 0.444939 x x 0
y 26 x 0.691271 x y 26
27 x 0.377185 x y 27
y x 6 x 0.850445 y x 6
31 x 0.700946 y x 31
z 28 x 0.225146 y z 28
29 x 0.435280 y z 29
y x x 13 y 0.926879 x x 13
37 y 0.253200 x x 37
y 38 y 0.548054 x y 38
39 y 0.778883 x y 39
z y 33 y 0.018688 z y 33
32 y 0.796487 z y 32
z 11 y 0.831104 z z 11
10 y 0.934829 z z 10
z y x 42 z 0.544838 y x 42
43 z 0.035073 y x 43
z 41 z 0.136097 y z 41
17 z 0.199844 y z 17
z y 20 z 0.278735 z y 20
45 z 0.706685 z y 45
z 22 z 0.085823 z z 22
47 z 0.776050 z z 47
Upvotes: 2
Reputation: 6302
There were a couple of issues with the code you tried to use.
lambda x: x.sample
returns the function itself instead of calling it (do lambda x: x.sample()
).x.sample()
returns a Series
object. .agg()
excepts a single scalar value or a list, so that causes an exception. You could do x.sample().tolist()
but now you have a list instead of a single value. It's more convenient to use numpy.random.choice()
here..agg()
does not work. Only the last one is going to stick. Assign the functions you want to use as a list or dict to that column instead.Taking the above into consideration, you'll reach this solution:
import numpy as np
df = df.groupby(['A', 'B', 'C']).agg({'ID': ('count', np.random.choice), 'AMT': 'sum'})
Upvotes: 2
Reputation: 19375
given that you aggregate using agg
, you must return a single element. So you can try using
'ID': lambda x: x.sample.tolist()
Upvotes: 0