Reputation: 333
This is my first step with pandas and data manipulation. I got basketball games statistics from kaggle (https://www.kaggle.com/pablote/nba-enhanced-stats). the dataset contains a datetime columns going from oct 31 2012 to oct 31 2018.
I'd like to group my rows per date range (nba season to be more precise), that means I'd have: - 1st group from oct 31 2012 to july 31 2013 - 2nd group from oct 31 2013 to july 31 2014 and so on till - nth group from oct 21 2017 to july 31 2018.
I read some thread mentioning groupby, date range but I'm not sure if this would work in my case.
Below is the code (jupyter notebook) I'm using right now
import pandas as pd
df1 = pd.read_csv('2012-18_teamBoxScore.csv')
df1.shape, df1.keys()
((14758, 123),
Index(['gmDate', 'gmTime', 'seasTyp', 'offLNm1', 'offFNm1', 'offLNm2',
'offFNm2', 'offLNm3', 'offFNm3', 'teamAbbr',
...
'opptFIC40', 'opptOrtg', 'opptDrtg', 'opptEDiff', 'opptPlay%', 'opptAR',
'opptAST/TO', 'opptSTL/TO', 'poss', 'pace'],
dtype='object', length=123))
df1['gmDate'] = pd.to_datetime(df1['gmDate']) # convert dmDate from string to datetime
# get data from a specific team
gs_df = df1[['gmDate', 'gmTime', 'teamAbbr', 'teamLoc', 'teamRslt', 'opptAbbr', 'opptLoc','opptRslt']][(df1.teamAbbr == 'GS')]
gs_df.shape
(492, 8)
gs_df.groupby(pd.Grouper(key='gmDate', freq='BA-SEP')).groups
{Timestamp('2013-09-30 00:00:00', freq='BA-SEP'): 82,
Timestamp('2014-09-30 00:00:00', freq='BA-SEP'): 164,
Timestamp('2015-09-30 00:00:00', freq='BA-SEP'): 246,
Timestamp('2016-09-30 00:00:00', freq='BA-SEP'): 328,
Timestamp('2017-09-29 00:00:00', freq='BA-SEP'): 410,
Timestamp('2018-09-28 00:00:00', freq='BA-SEP'): 492}
Here we can see the grouping is actually a cumulative grouping whereas I'd like each group to have 82 rows
What would be the best approach to do this split?
Thanks for your feedbacks
S,
Upvotes: 0
Views: 420
Reputation: 3184
I think in this situation, I would like to make a column with the actual season in it, like season 2017 for example, and then make a multiindex using season, team, games. With this approach, you can slice and dice the data however you wish.
import pandas as pd
import numpy as np
df1 = pd.read_csv('2012-18_teamBoxScore.csv')
# convert dmDate from string to datetime
df1['gmDate'] = pd.to_datetime(df1['gmDate'])
# list of start dates, and the end date to the last year
dates = [
"2012-09-30",
"2013-09-30",
"2014-09-30",
"2015-09-30",
"2016-09-30",
"2017-09-29",
"2018-09-28",
"2019-09-28",
]
# make dates datetime
dates = pd.to_datetime(dates)
seasons = ['2012', '2013', '2014', '2015', '2016', '2017', '2018']
# create a season column using cut
df1['season'] = pd.cut(df1['gmDate'], dates, labels=seasons)
# a multiindex is easier to work with
df1.set_index(['teamAbbr', 'season', 'gmDate'], inplace=True)
columns = ['gmTime', 'teamLoc', 'teamRslt', 'opptAbbr', 'opptLoc','opptRslt']
# use an [index slicer][1] to get whatever data you want.
idx = pd.IndexSlice
df1.loc[idx[['CLE', 'DAL'], ['2012', '2014'], slice(None)],columns].head(2)
For result:
gmTime teamLoc teamRslt opptAbbr opptLoc opptRslt
teamAbbr season gmDate
CLE 2012 2012-10-30 19:00 Home Win WAS Away Loss
DAL 2012 2012-10-30 22:30 Away Win LAL Home Loss
# if you want to see all seasons or all teams, just replace with slice(None)
df1.loc[idx[['CLE'], slice(None), slice(None)],columns]
gmTime teamLoc teamRslt opptAbbr opptLoc opptRslt
teamAbbr season gmDate
CLE 2012 2012-10-30 19:00 Home Win WAS Away Loss
2012-11-02 19:30 Home Loss CHI Away Win
2012-11-03 20:30 Away Loss MIL Home Win
Upvotes: 1
Reputation: 23
You could use the pandas cut method as explained here Pandas Groupby Range of Values . You might need to convert the dates to a numeric timestamp though prior to using the cut function.
Upvotes: 1