Doug
Doug

Reputation: 207

Sampling a grouped dataframe by column

I have a dataframe which I am grouping as follows and applying certain operations to particular columns:

 df = df.groupby(['A', 'B', 'C']).agg({'ID': 'count', 'AMT': 'sum'})

For each groupby combination (~15) I want to randomly sample the rows belonging to each combination and return a sample ID and report it in a third output column. OR I really just want one of the IDs from the rows in that combination to appear in the table, I don't care if it is 'random' or not.

I have tried the following:

 df = df.groupby(['A', 'B', 'C']).agg({'ID': 'count', 'AMT': 'sum', 'ID': 'sample'})

and received the error:

 AttributeError: Cannot access callable attribute 'sample' of 'SeriesGroupBy' objects, try using the 'apply' method

So I then tried:

 func = lambda x: x.sample
 df = df.groupby(['A', 'B', 'C']).agg({'ID': 'count', 'AMT': 'sum', 'ID': apply(func)})

which didnt work so I tried

df = df.groupby(['A', 'B', 'C']).agg({'ID': 'count', 'AMT': 'sum', 'ID': lambda x: x.sample})   

which also didn't work. I have reviewed the following links for related questions but they did not seem to help me either.

Select multiple groups from pandas groupby object

http://pandas.pydata.org/pandas-docs/stable/groupby.html

Get specific element from Groups after applying groupby - PANDAS

How to access pandas groupby dataframe by key

https://chrisalbon.com/python/pandas_apply_operations_to_dataframes.html

Any thoughts on how to handle?

Upvotes: 2

Views: 642

Answers (3)

piRSquared
piRSquared

Reputation: 294258

Consider the example dataframe df

np.random.seed([3,1415])
df = pd.DataFrame(dict(
        A=list('x'*8 + 'y'*8 + 'z'*8) * 2,
        B=list('x'*4 + 'y'*4 + 'z'*4) * 4,
        C=list('x'*2 + 'y'*2 + 'z'*2) * 8,
        ID=range(48),
        AMT=np.random.rand(48)
    ))

print(df)

    A       AMT  B  C  ID
0   x  0.444939  x  x   0
1   x  0.407554  x  x   1
2   x  0.460148  x  y   2
3   x  0.465239  x  y   3
4   x  0.462691  y  z   4
5   x  0.016545  y  z   5
6   x  0.850445  y  x   6
7   x  0.817744  y  x   7
8   y  0.777962  z  y   8
9   y  0.757983  z  y   9
...
39  y  0.778883  x  y  39
40  z  0.651676  y  z  40
41  z  0.136097  y  z  41
42  z  0.544838  y  x  42
43  z  0.035073  y  x  43
44  z  0.275079  z  y  44
45  z  0.706685  z  y  45
46  z  0.713614  z  z  46
47  z  0.776050  z  z  47

Solution #1

You can use np.random.choice to pick one

f = dict(
    ID=dict(Count='count', Sample=np.random.choice),
    AMT=dict(Sum='sum', Max='max', Min='min')
)

df.groupby(['A', 'B', 'C']).agg(f)

            AMT                         ID      
            Sum       Max       Min Sample Count
A B C                                           
x x x  2.458188  0.866059  0.407554     25     4
    y  1.993843  0.691271  0.377185     27     4
  y x  3.070036  0.850445  0.700900      7     4
    z  1.139663  0.462691  0.016545     28     4
y x x  2.824838  0.926879  0.253200     13     4
    y  2.166114  0.778883  0.117642     39     4
  z y  2.351120  0.796487  0.018688      8     4
    z  3.367248  0.934829  0.700566     10     4
z y x  1.118176  0.544838  0.035073     19     4
    z  1.133523  0.651676  0.136097     16     4
  z y  1.870361  0.706685  0.275079     44     4
    z  2.412484  0.836997  0.085823     47     4

Solution #2

pick more than 1

You can use pd.DataFrame.sample to take whole slices of df

df.groupby(['A', 'B', 'C']).apply(pd.DataFrame.sample, n=2)

          A       AMT  B  C  ID
A B C                          
x x x 25  x  0.866059  x  x  25
      0   x  0.444939  x  x   0
    y 26  x  0.691271  x  y  26
      27  x  0.377185  x  y  27
  y x 6   x  0.850445  y  x   6
      31  x  0.700946  y  x  31
    z 28  x  0.225146  y  z  28
      29  x  0.435280  y  z  29
y x x 13  y  0.926879  x  x  13
      37  y  0.253200  x  x  37
    y 38  y  0.548054  x  y  38
      39  y  0.778883  x  y  39
  z y 33  y  0.018688  z  y  33
      32  y  0.796487  z  y  32
    z 11  y  0.831104  z  z  11
      10  y  0.934829  z  z  10
z y x 42  z  0.544838  y  x  42
      43  z  0.035073  y  x  43
    z 41  z  0.136097  y  z  41
      17  z  0.199844  y  z  17
  z y 20  z  0.278735  z  y  20
      45  z  0.706685  z  y  45
    z 22  z  0.085823  z  z  22
      47  z  0.776050  z  z  47

Upvotes: 2

Martin Valgur
Martin Valgur

Reputation: 6302

There were a couple of issues with the code you tried to use.

  • lambda x: x.sample returns the function itself instead of calling it (do lambda x: x.sample()).
  • x.sample() returns a Series object. .agg() excepts a single scalar value or a list, so that causes an exception. You could do x.sample().tolist() but now you have a list instead of a single value. It's more convenient to use numpy.random.choice() here.
  • Specifying the same column multiple times in the dictionary passed to .agg() does not work. Only the last one is going to stick. Assign the functions you want to use as a list or dict to that column instead.

Taking the above into consideration, you'll reach this solution:

import numpy as np
df = df.groupby(['A', 'B', 'C']).agg({'ID': ('count', np.random.choice), 'AMT': 'sum'})

Upvotes: 2

ℕʘʘḆḽḘ
ℕʘʘḆḽḘ

Reputation: 19375

given that you aggregate using agg, you must return a single element. So you can try using

'ID': lambda x: x.sample.tolist()

Upvotes: 0

Related Questions