Simon
Simon

Reputation: 333

How to group pandas dataframe rows per range of date

This is my first step with pandas and data manipulation. I got basketball games statistics from kaggle (https://www.kaggle.com/pablote/nba-enhanced-stats). the dataset contains a datetime columns going from oct 31 2012 to oct 31 2018.

I'd like to group my rows per date range (nba season to be more precise), that means I'd have: - 1st group from oct 31 2012 to july 31 2013 - 2nd group from oct 31 2013 to july 31 2014 and so on till - nth group from oct 21 2017 to july 31 2018.

I read some thread mentioning groupby, date range but I'm not sure if this would work in my case.

Below is the code (jupyter notebook) I'm using right now

import pandas as pd
df1 = pd.read_csv('2012-18_teamBoxScore.csv')
df1.shape, df1.keys()
    ((14758, 123),
     Index(['gmDate', 'gmTime', 'seasTyp', 'offLNm1', 'offFNm1', 'offLNm2',
    'offFNm2', 'offLNm3', 'offFNm3', 'teamAbbr',
    ...
    'opptFIC40', 'opptOrtg', 'opptDrtg', 'opptEDiff', 'opptPlay%', 'opptAR',
    'opptAST/TO', 'opptSTL/TO', 'poss', 'pace'],
    dtype='object', length=123))
df1['gmDate'] = pd.to_datetime(df1['gmDate'])  # convert dmDate from string to datetime
# get data from a specific team
gs_df = df1[['gmDate', 'gmTime', 'teamAbbr', 'teamLoc', 'teamRslt', 'opptAbbr', 'opptLoc','opptRslt']][(df1.teamAbbr == 'GS')]
gs_df.shape
    (492, 8)
gs_df.groupby(pd.Grouper(key='gmDate', freq='BA-SEP')).groups
    {Timestamp('2013-09-30 00:00:00', freq='BA-SEP'): 82,
     Timestamp('2014-09-30 00:00:00', freq='BA-SEP'): 164,
     Timestamp('2015-09-30 00:00:00', freq='BA-SEP'): 246,
     Timestamp('2016-09-30 00:00:00', freq='BA-SEP'): 328,
     Timestamp('2017-09-29 00:00:00', freq='BA-SEP'): 410,
     Timestamp('2018-09-28 00:00:00', freq='BA-SEP'): 492}

Here we can see the grouping is actually a cumulative grouping whereas I'd like each group to have 82 rows

What would be the best approach to do this split?

Thanks for your feedbacks

S,

Upvotes: 0

Views: 420

Answers (2)

run-out
run-out

Reputation: 3184

I think in this situation, I would like to make a column with the actual season in it, like season 2017 for example, and then make a multiindex using season, team, games. With this approach, you can slice and dice the data however you wish.

import pandas as pd
import numpy as np

df1 = pd.read_csv('2012-18_teamBoxScore.csv')

# convert dmDate from string to datetime
df1['gmDate'] = pd.to_datetime(df1['gmDate'])  

# list of start dates, and the end date to the last year
dates = [
    "2012-09-30",
    "2013-09-30",
    "2014-09-30",
    "2015-09-30",
    "2016-09-30",
    "2017-09-29",
    "2018-09-28",
    "2019-09-28",
]

# make dates datetime
dates = pd.to_datetime(dates)

seasons = ['2012', '2013', '2014', '2015', '2016', '2017', '2018']

# create a season column using cut
df1['season'] = pd.cut(df1['gmDate'], dates, labels=seasons)

# a multiindex is easier to work with
df1.set_index(['teamAbbr', 'season', 'gmDate'], inplace=True)

columns = ['gmTime', 'teamLoc', 'teamRslt', 'opptAbbr', 'opptLoc','opptRslt']

# use an [index slicer][1] to get whatever data you want.
idx = pd.IndexSlice
df1.loc[idx[['CLE', 'DAL'], ['2012', '2014'], slice(None)],columns].head(2)

For result:

                              gmTime  teamLoc  teamRslt opptAbbr opptLoc opptRslt
teamAbbr    season  gmDate                      
CLE          2012   2012-10-30  19:00   Home    Win       WAS     Away    Loss
DAL          2012   2012-10-30  22:30   Away    Win       LAL     Home    Loss


# if you want to see all seasons or all teams, just replace with slice(None)
df1.loc[idx[['CLE'], slice(None), slice(None)],columns]

                      gmTime teamLoc teamRslt opptAbbr opptLoc opptRslt
teamAbbr season gmDate                      
CLE 2012    2012-10-30  19:00   Home    Win     WAS     Away    Loss
            2012-11-02  19:30   Home    Loss    CHI     Away    Win
            2012-11-03  20:30   Away    Loss    MIL     Home    Win

Upvotes: 1

Elad Shtilerman
Elad Shtilerman

Reputation: 23

You could use the pandas cut method as explained here Pandas Groupby Range of Values . You might need to convert the dates to a numeric timestamp though prior to using the cut function.

Upvotes: 1

Related Questions