Reputation: 534
I have some data that looks like this:
Season Team TEAM_ID start end
0 1984-85 CHI 1610612741 1984 1985
1 1985-86 CHI 1610612741 1985 1986
2 1986-87 CHI 1610612741 1986 1987
3 1987-88 CHI 1610612741 1987 1988
4 1988-89 CHI 1610612741 1988 1989
5 1989-90 CHI 1610612741 1989 1990
6 1990-91 CHI 1610612741 1990 1991
7 1991-92 CHI 1610612741 1991 1992
8 1992-93 CHI 1610612741 1992 1993
9 1994-95 CHI 1610612741 1994 1995
10 1995-96 CHI 1610612741 1995 1996
11 1996-97 CHI 1610612741 1996 1997
12 1997-98 CHI 1610612741 1997 1998
13 2001-02 WAS 1610612764 2001 2002
14 2002-03 WAS 1610612764 2002 2003
I'm looking for a way to group the team and team id columns together and get the minimum start value and maximum end column. For the above data, it would be
Team TEAM_ID Years
CHI 1610612741 1984-93
CHI 1610612741 1994-98
WAS 1610612764 2001-03
For someone that has multiple teams in one year,
Season Team TEAM_ID start end
0 2003-04 MIA 1610612748 2003 2004
1 2004-05 MIA 1610612748 2004 2005
2 2005-06 MIA 1610612748 2005 2006
3 2006-07 MIA 1610612748 2006 2007
4 2007-08 MIA 1610612748 2007 2008
5 2008-09 MIA 1610612748 2008 2009
6 2009-10 MIA 1610612748 2009 2010
7 2010-11 MIA 1610612748 2010 2011
8 2011-12 MIA 1610612748 2011 2012
9 2012-13 MIA 1610612748 2012 2013
10 2013-14 MIA 1610612748 2013 2014
11 2014-15 MIA 1610612748 2014 2015
12 2015-16 MIA 1610612748 2015 2016
13 2016-17 CHI 1610612741 2016 2017
14 2017-18 CLE 1610612739 2017 2018
15 2017-18 MIA 1610612748 2017 2018
17 2018-19 MIA 1610612748 2018 2019
I'd like it to look like this:
Team TEAM_ID Years
MIA 1610612748 2003-16
CHI 1610612741 2016-17
CLE 1610612739 2017-17
MIA 1610612748 2017-19
Does anyone know how to do that? I tried using pandas.group_by
but it would group the same teams together as one and I'd like to keep them separate
Upvotes: 3
Views: 132
Reputation: 28644
Another solution to ur question : It uses a combination of Pandas' shift method to find differences between rows and the groupby function
def grouping(df):
#condition checks if row - previous row is not equal to 1 (end column)
#or row not equal to previous row for the Team column
cond = df.end.sub(df.end.shift()).ne(1) | (df.Team.ne(df.Team.shift()))
#get rows where the end year does not change
no_year_end_change = df.end.shift(-1).sub(df.end).eq(0)
#create a new column to get values from the start column based on the condition
df['change'] = df.loc[cond,'start']
#create a new column to get values from the end column based on the condition
df['end_edit'] = np.where(no_year_end_change,df.start,df.end)
#integer conversion... gets rids of the float 0s
df['change'] = df.change.ffill().astype('Int64')
#groupby, get the max of the end column
df = df.groupby(['Team','TEAM_ID','change']).end_edit.max().reset_index()
#combine change and end columns using Pandas' str cat function
df['Years'] = df.change.astype(str).str.cat(df.end_edit.astype(str),sep='-')
df = df.drop(['change','end_edit'],axis = 1)
return df
First dataframe:
df.pipe(grouping)
Team TEAM_ID Years
0 CHI 1610612741 1984-1993
1 CHI 1610612741 1994-1998
2 WAS 1610612764 2001-2003
Second dataframe:
df1.pipe(grouping)
Team TEAM_ID Years
0 CHI 1610612741 2016-2017
1 CLE 1610612739 2017-2017
2 MIA 1610612748 2003-2016
3 MIA 1610612748 2017-2019
Upvotes: 2
Reputation: 14738
One way is to use nested groupby
to identify consecutive seasons within a team:
def func(df):
# indicator of consecutive seasons
g = (df['start'] > df['end'].shift(1)).cumsum()
res = df.groupby(g).apply(
lambda x: str(x['start'].min()) + '-' + str(x['end'].max())[-2:],
)
res.name = 'Years'
return res
df.groupby(['Team', 'TEAM_ID']).apply(func).reset_index()[['Team', 'TEAM_ID', 'Years']]
Output:
Team TEAM_ID Years
0 CHI 1610612741 2016-17
1 CLE 1610612739 2017-18
2 MIA 1610612748 2003-16
3 MIA 1610612748 2017-19
Upvotes: 3