Alex Poca
Alex Poca

Reputation: 2566

Pandas: how to group on column change?

I am working with a log system, and I need to group data not in a standard way. Alas with my limited knowledge of Pandas I couldn't find any example, probably because I don't know proper search terms.

This is a sample dataframe:

df = pd.DataFrame({
    "speed": [2, 4, 6, 8, 8, 9, 2, 3, 8, 9, 13, 18, 25, 27, 18, 8, 6, 8, 12, 20, 27, 34, 36, 41, 44, 54, 61, 60, 61, 40, 17, 12, 15, 24], 
    "class": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5, 3, 1, 1, 1, 2]
})

df.groupby(by="class").groups returns indexed of each row, all grouped together by class value:

class  indexes
1: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 15, 16, 17, 18, 30, 32],
2: [12, 13, 19, 20, 21, 22, 33],  
3: [23, 24, 29],
4: [25],
5: [26, 27, 28]

I need instead to split every time column class changes:

    speed  class
0       2      1
1       4      1
2       6      1
3       8      1
4       8      1
5       9      1
6       2      1
7       3      1
8       8      1
9       9      1
10     13      1
11     18      1
12     25      2  <= split here
13     27      2
14     18      1  <= split here
15      8      1
16      6      1
17      8      1
18     12      1  <= split here
19     20      2
20     27      2
21     34      2
22     36      2  <= split here
23     41      3
24     44      3  <= split here
25     54      4  <= split here
26     61      5
27     60      5
28     61      5  <= split here
29     40      3  <= split here
30     17      1  <= split here
31     12      1
32     15      1
33     24      2  <= split here

The desired grouping should return something like:

   class  count   mean
0    1      12    70.50
1    2       2    26.00
2    1       5    10.40
3    2       4    29.25
4    3       2    42.50
5    4       1    54.00
6    5       3    60.66
7    3       1    40.00
8    1       3    14.66
9    2       1    24.00

Is there any command to do it not iteratively?

Upvotes: 2

Views: 491

Answers (2)

sacuL
sacuL

Reputation: 51425

You can groupby the cumsum of when the class column differs from the value below it:

df.groupby(df["class"].diff().ne(0).cumsum()).speed.agg(['size', 'mean'])

       size       mean
class
1        12   7.500000
2         2  26.000000
3         5  10.400000
4         4  29.250000
5         2  42.500000
6         1  54.000000
7         3  60.666667
8         1  40.000000
9         3  14.666667
10        1  24.000000

Update: I hadn't seen how you wanted the class column: what you can do is group by the original class column as well as the cumsum above, and do a bit of index-sorting and resetting (but at this point this answer just converges with @jezrael's answer :P)

result = (
    df.groupby(["class", df["class"].diff().ne(0).cumsum()])
    .speed.agg(["size", "mean"])
    .sort_index(level=1)
    .reset_index(level=0)
    .reset_index(drop=True)
)

   class  size       mean
0      1    12   7.500000
1      2     2  26.000000
2      1     5  10.400000
3      2     4  29.250000
4      3     2  42.500000
5      4     1  54.000000
6      5     3  60.666667
7      3     1  40.000000
8      1     3  14.666667
9      2     1  24.000000

Upvotes: 3

jezrael
jezrael

Reputation: 863701

Use Series.cumsum with compare if not equal shifted values and aggregate by GroupBy.agg:

g = df["class"].ne(df["class"].shift()).cumsum()
df = (df.groupby(['class', g], sort=False)['speed'].agg(['count','mean'])
        .reset_index(level=1, drop=True)
        .reset_index())
print (df)
   class  count       mean
0      1     12   7.500000
1      2      2  26.000000
2      1      5  10.400000
3      2      4  29.250000
4      3      2  42.500000
5      4      1  54.000000
6      5      3  60.666667
7      3      1  40.000000
8      1      3  14.666667
9      2      1  24.000000

Upvotes: 4

Related Questions