Reputation: 398

grouping time-series data based on starting and ending date

I have time-series data of a yearly sports tournament, with the date when each game was played. I want to group the games by the season(year) they were played in. Each season starts in August and ends the NEXT year in july.

How would I go about grouping the games by season, like - season(2016-2017), season(2017-2018), etc..

This Answer involving df.resample() may be related, but I'm not sure how I'd go about doing it.

This is what the date column looks like:

DATE
26/09/09
04/10/09
17/10/09
25/10/09
31/10/09
  ...   
29/09/18
07/10/18
28/10/18
03/11/18

I want to group by seasons so that I can perform visualization operations over the aggregated data.

UPDATE: For the time being my solution is to split up the dataframe into groups of 32 as I know each season has 32 games. This is the code I've used:

split_df = np.array_split(df, np.arange(0, len(df),32))

But I'd rather prefer something more elegant and more inclusive of time-series data so I'll keep the question open.

Upvotes: 3

Answers (4)

Valdi_Bo

Reputation: 30971

The key to success is proper grouping, in your case pd.Grouper(key='DATA', freq='AS-AUG').

Note that freq='AS-AUG' states that your groups should start from the start of August each year.

Look at the following script:

import pandas as pd

# Source columns
dates = [ '01/04/09', '31/07/09', '01/08/09', '26/09/09', '04/10/09', '17/12/09', 
    '25/01/10', '20/04/10', '31/07/10', '01/08/10', '28/10/10', '03/11/10',
    '25/12/10', '20/04/11', '31/07/11' ]
scores_x = np.random.randint(0, 20, len(dates))
scores_y = np.random.randint(0, 20, len(dates))
# Source DataFrame
df = pd.DataFrame({'DATA': dates, 'SCORE_X': scores_x, 'SCORE_Y': scores_y})
# Convert string date to datetime
df.DATA = pd.to_datetime(df.DATA, format='%d/%m/%y')
# Grouping
gr = df.groupby(pd.Grouper(key='DATA', freq='AS-AUG'))

If you print the results:

for name, group in gr:
    print()
    print(name)
    print(group)

you will get:

2008-08-01 00:00:00
        DATA  SCORE_X  SCORE_Y
0 2009-04-01       16       11
1 2009-07-31       10        7

2009-08-01 00:00:00
        DATA  SCORE_X  SCORE_Y
2 2009-08-01       19        6
3 2009-09-26       14        5
4 2009-10-04        8       11
5 2009-12-17       12       19
6 2010-01-25        0        0
7 2010-04-20       17        6
8 2010-07-31       18        2

2010-08-01 00:00:00
         DATA  SCORE_X  SCORE_Y
9  2010-08-01       15       18
10 2010-10-28        2        4
11 2010-11-03        8       16
12 2010-12-25       13        1
13 2011-04-20       19        7
14 2011-07-31        8        3

As you can see, each group starts just on 1-st of August and ends on 31-st of July.

They you can do with your groups whatever you want.

Upvotes: 2

Justinas Marozas

Reputation: 2684

You could build a season column and group by that. In below code, I used pandas.DateOffset() to move all dates 7 months back so a game that happened in August would look like it happened in January to align the season year with the calendar year. Building season string is fairly straightforward after that.

import pandas as pd
from datetime import date

dates = pd.date_range(date(2009, 8, 1), date(2018, 7, 30), freq='17d')
df = pd.DataFrame(dates, columns=['date'])

# copy the date column to a separate dataframe to do the work
df_tmp = df[['date']]
df_tmp['season_start_year'] = (df_tmp['date'] - pd.DateOffset(months=7)).dt.year
df_tmp['season_end_year'] = df_tmp['season_start_year'] + 1
df_tmp['season'] = df_tmp['season_start_year'].map(str) + '-' + df_tmp['season_end_year'].map(str)

# copy season column to the main dataframe
df['season'] = df_tmp['season']

df.groupby('season').count()

Upvotes: 0

gosuto

Reputation: 5741

Resampling using 'A-JUL' as an anchored offset alias should do the trick:

>>> df
            SAMPLE
DATE              
2009-01-30       1
2009-07-10       4
2009-11-20       3
2010-01-01       5
2010-05-13       1
2010-08-01       1
>>> df.resample('A-JUL').sum()
            SAMPLE
DATE              
2009-07-31       5
2010-07-31       9
2011-07-31       1

A indicates it is a yearly interval, -JUL indicates it ends in July.

Upvotes: 1

Vivek Kalyanarangan

Reputation: 9081

Use -

df.groupby(df['DATE'].dt.year).count()

Output

    DATE
DATE    
2009    5
2018    4

Custom Season Grouping

min_year = df['DATE'].dt.year.min()
max_year = df['DATE'].dt.year.max()
rng = pd.date_range(start='{}-07'.format(min_year), end='{}-08'.format(max_year), freq='12M').to_series()
df.groupby(pd.cut(df['DATE'], rng)).count()

Output

    DATE
DATE    
(2009-07-31, 2010-07-31]    3
(2010-07-31, 2011-07-31]    0
(2011-07-31, 2012-07-31]    0
(2012-07-31, 2013-07-31]    0
(2013-07-31, 2014-07-31]    0
(2014-07-31, 2015-07-31]    0
(2015-07-31, 2016-07-31]    0
(2016-07-31, 2017-07-31]    0
(2017-07-31, 2018-07-31]    1

Upvotes: 1

grouping time-series data based on starting and ending date

Answers (4)

Related Questions