Pandas Dataframe.groupby() for separate groups of same value

Question

I have a dataframe of weather data and want to be able to group the data by season:

            yyyy  mm   rainfall  season
date                                                             
1941-01-01  1941   1   74.7      Winter
1941-02-01  1941   2   69.1      Winter
1941-03-01  1941   3   76.2      Spring
1941-04-01  1941   4   33.7      Spring
1941-05-01  1941   5   51.3      Spring
1941-06-01  1941   6   25.7      Summer
1941-07-01  1941   7   53.9      Summer
1941-08-01  1941   8   91.8      Summer
1941-09-01  1941   9   25.5      Autumn
1941-10-01  1941  10   106.2     Autumn
1941-11-01  1941  11   92.3      Autumn
1941-12-01  1941  12   86.5      Winter
1942-01-01  1942   1   114.0     Winter
1942-02-01  1942   2   13.8      Winter
1942-03-01  1942   3   58.0      Spring
1942-04-01  1942   4   42.5      Spring
1942-05-01  1942   5  101.1      Spring
1942-06-01  1942   6   2.3       Summer
1942-07-01  1942   7   70.2      Summer
1942-08-01  1942   8   78.5      Summer
1942-09-01  1942   9  146.8      Autumn
1942-10-01  1942  10  131.1      Autumn
1942-11-01  1942  11   19.8      Autumn
1942-12-01  1942  12  183.9      Winter

How would I use df.groupby() to get the following:

            season   rainfall  
date                                                             
1941-01-01  Winter   [74.7, 69.1]      
1941-03-01  Spring   [76.2,33.7,51.3]  
1941-06-01  Summer   [25.7,53.9,91.8]      
1941-09-01  Autumn   [25.5,106.2,92.3]      
1941-12-01  Winter   [86.5,114.0,13.8]      
1942-03-01  Spring   [58.0,42.5,101.1]      
1942-06-01  Summer   [2.3,70.2,78.5]       
1942-09-01  Autumn   [146.8,131.1,19.8]      
1942-12-01  Winter   [183.9 ]

I tried df.groupby(['season'])['rainfall'] but that combines every winter, whereas I only want to group each cluster of winter etc.

chrisb · Accepted Answer

This isn't directly supported in pandas at the moment (see issue here), but there's something of a trick for doing it.

The following is used mark contiguous groups, the comparison/shift is comparing each season to the prior value to mark the changes as 1, then the cumsum creates new groups for each change.

df['marker'] = (df['season'] != df['season'].shift()).cumsum()

From there, you can groupby and do whatever you want - for example for your case:

df.groupby('marker').agg({'date': 'first',  'season': 'first', 
                          'rainfall': lambda x: list(x)})
Out[14]: 
             date  season              rainfall
marker                                         
1      1941-01-01  Winter          [74.7, 69.1]
2      1941-03-01  Spring    [76.2, 33.7, 51.3]
3      1941-06-01  Summer    [25.7, 53.9, 91.8]
4      1941-09-01  Autumn   [25.5, 106.2, 92.3]
5      1941-12-01  Winter   [86.5, 114.0, 13.8]
6      1942-03-01  Spring   [58.0, 42.5, 101.1]
7      1942-06-01  Summer     [2.3, 70.2, 78.5]
8      1942-09-01  Autumn  [146.8, 131.1, 19.8]
9      1942-12-01  Winter               [183.9]

Pandas Dataframe.groupby() for separate groups of same value

Answers (1)

Related Questions