Groupby by two differents options simultaneously

Question

Good morning.

I've a pandas dataframe like the following:

df = 
     p   f   c     a
 0   1   2   1    16.32
 1   1   2   2    48
 2   1   2   3    60
 3   1   2   4   112
 4   1   2   5    52
 5   1   3   6   288
 6   1   4   7   201
 7   1   4   8    52
 8   1   4   4    44
 9   1   5   7   251.2
10   1   5   9   220
11   1   5   8    83
12   1   5  10   142
13   2   1  11   100
14   2   1  12   110
15   2   2  11   120
16   2   2  13   130
17   2   3  13   140
18   2   3  14   150
19   2   4  12   160

And I want to do a groupby along columns c and a, but grouping c using something like SQL's COUNT(DISTINCT) and grouping a using sum(), in a way my result will be:

df_result = 
    p   f   c      a
0   1   2   5    288.32
1   1   3   6    576.32
2   1   4   8    873.92
3   1   5  10   1570.12
4   2   1   2    210
5   2   2   3    460
6   2   3   4    750
7   2   4   4    910

But I can't reach that result trying differents combinations of groupby and stack.

EDIT Take into account that column 'c' stores ID numbers, so the ascending order is just an example, so max aggregate wouldn't work. Sorry for don't say it before.

I think that a possible solution would be split it in two differents dataframes, grouping and then merge, but I'm not sure if this is the best solution.

Thanks you very much in advance.

jezrael · Accepted Answer

You need aggregate list and sum first, then call DataFrame.cumsum:

df = df.groupby('f').agg({'c':list, 'a':'sum'}).cumsum()
print (df)
                                                   c        a
f                                                            
2                           [154, 215, 1, 8000, 214]   288.32
3                      [154, 215, 1, 8000, 214, 640]   576.32
4      [154, 215, 1, 8000, 214, 640, 780, 830, 8000]   873.32
5  [154, 215, 1, 8000, 214, 640, 780, 830, 8000, ...  1569.52

And last get length of unique values per lists:

df['c'] = df['c'].apply(lambda x: len(set(x)))
df = df.reset_index()

print (df)
   f   c        a
0  2   5   288.32
1  3   6   576.32
2  4   8   873.32
3  5  10  1569.52

EDIT:

df = (df.groupby(['p','f']).agg({'c':list, 'a':'sum'})
        .groupby('p').apply(np.cumsum))
df['c'] = df['c'].apply(lambda x: len(set(x)))
df = df.reset_index()
print (df)
   p  f   c        a
0  1  2   5   288.32
1  1  3   6   576.32
2  1  4   8   873.32
3  1  5  10  1569.52
4  2  1   2      210
5  2  2   3      460
6  2  3   4      750
7  2  4   4      910

Groupby by two differents options simultaneously

Answers (1)

Related Questions