Pandas group dataframe by column and index adjacency

Question

Say I have the following dataframe:

>>> Data=pd.DataFrame()
>>> Data['Color']=['Green','Green','Green','Blue','Blue','Green','Green','Blue','Green','Yellow']
>>> Data['Count']=np.random.randint(0,100,10)
>>> Data
    Color  Count
0   Green     75
1   Green     53
2   Green     89
3    Blue     66
4    Blue     45
5   Green     98
6   Green      7
7    Blue     28
8   Green     28
9  Yellow      7

How can I group by both the column "Color" and groups formed by adjacency to other records of the same value in that column. For example my desired output would be something like the following (note that the function applied on the 'Count' column is arbitrary, but in this case I used sum).

              Value
Group Color        
0     Blue      111
      Green     217
      Yellow      7
1     Blue       28
      Green     105
2     Green      28

The first 3 records all have the same value for color and are therefore in group 0 for green. The next 2 records have the same value for color and are therefore in group 0 for blue. Then there are 2 more records in green which would put them in group 1 for green. Etc, etc.

DSM · Accepted Answer

This one's a bit tricky. IIUC, you can get the result you want like this:

>>> df = pd.DataFrame({'Colour': {0: 'Green', 1: 'Green', 2: 'Green', 3: 'Blue', 4: 'Blue', 5: 'Green', 6: 'Green', 7: 'Blue', 8: 'Green', 9: 'Yellow'}, 'Count': {0: 75, 1: 53, 2: 89, 3: 66, 4: 45, 5: 98, 6: 7, 7: 28, 8: 28, 9: 7}})
>>> cid = (df["Colour"] != df["Colour"].shift()).cumsum()
>>> df["Group"] = cid.groupby(df["Colour"]).rank("dense") - 1
>>> df.groupby(["Group", "Colour"]).sum()
              Count
Group Colour       
0     Blue      111
      Green     217
      Yellow      7
1     Blue       28
      Green     105
2     Green      28

This works because cid is a "consecutive-cluster" id built using the shift-compare-cumsum pattern. After we have cid, we can group these cluster ids by colour and rank them densely to get the group ids.

Pandas group dataframe by column and index adjacency

Answers (1)

Related Questions