Reputation: 3379
Say I have the following dataframe:
>>> Data=pd.DataFrame()
>>> Data['Color']=['Green','Green','Green','Blue','Blue','Green','Green','Blue','Green','Yellow']
>>> Data['Count']=np.random.randint(0,100,10)
>>> Data
Color Count
0 Green 75
1 Green 53
2 Green 89
3 Blue 66
4 Blue 45
5 Green 98
6 Green 7
7 Blue 28
8 Green 28
9 Yellow 7
How can I group by both the column "Color" and groups formed by adjacency to other records of the same value in that column. For example my desired output would be something like the following (note that the function applied on the 'Count' column is arbitrary, but in this case I used sum).
Value
Group Color
0 Blue 111
Green 217
Yellow 7
1 Blue 28
Green 105
2 Green 28
The first 3 records all have the same value for color and are therefore in group 0 for green. The next 2 records have the same value for color and are therefore in group 0 for blue. Then there are 2 more records in green which would put them in group 1 for green. Etc, etc.
Upvotes: 3
Views: 183
Reputation: 353409
This one's a bit tricky. IIUC, you can get the result you want like this:
>>> df = pd.DataFrame({'Colour': {0: 'Green', 1: 'Green', 2: 'Green', 3: 'Blue', 4: 'Blue', 5: 'Green', 6: 'Green', 7: 'Blue', 8: 'Green', 9: 'Yellow'}, 'Count': {0: 75, 1: 53, 2: 89, 3: 66, 4: 45, 5: 98, 6: 7, 7: 28, 8: 28, 9: 7}})
>>> cid = (df["Colour"] != df["Colour"].shift()).cumsum()
>>> df["Group"] = cid.groupby(df["Colour"]).rank("dense") - 1
>>> df.groupby(["Group", "Colour"]).sum()
Count
Group Colour
0 Blue 111
Green 217
Yellow 7
1 Blue 28
Green 105
2 Green 28
This works because cid
is a "consecutive-cluster" id built using the shift-compare-cumsum pattern. After we have cid
, we can group these cluster ids by colour and rank them densely to get the group ids.
Upvotes: 2