Reputation: 39
Let's say I have a df, memories
Year | Number_of_photos |
---|---|
1990 | 27 |
1991 | 21 |
1992 | 39 |
1993 | 22 |
1994 | 10 |
1995 | 12 |
1996 | 22 |
1997 | 22 |
1998 | 40 |
1999 | 27 |
2000 | 18 |
2001 | 20 |
2002 | 18 |
2003 | 14 |
2004 | 33 |
2005 | 10 |
2006 | 24 |
2015 | 17 |
And I want to create a function that will spit out a data frame, where the user will input the start year and an integer, and that integer will determine the timeframes (i.e. if start year=1998 and int=4, then the timeframes will be 1998-2001, 2002-2005, 2006-2009,2010-2013, and 2014-2017). And within those timeframes, the function will generate the max Number_of_photos (i.e. 1998-2001 will generate 40, 2002-2005 will generate 33, 2006-2009 will generate 24, 2010-2013 will generate 'None', and lastly 2014-2017 will generate 17).
Any ideas on how I to define a function that will generate a data frame?
[In] def max_photos(df,start,integer):
[Out] |Timeframe|Max_Year|Max_Num_Photos|
|---------|--------|--------------|
|1998-2001| 1998 | 40 |
|2002-2005| 2004 | 33 |
|2006-2009| 2006 | 24 |
|2010-2013| None | None |
|2014-2017| 2015 | 17 |
Upvotes: 2
Views: 492
Reputation: 863301
Use cut
with aggregate max
and Series.idxmax
, last for add missing categories add DataFrame.reindex
:
def max_photos(df,start,integer):
df = df[df['Year'] >= start].copy()
bins = range(start, df.Year.max() + integer, integer)
labels = [f'{i}-{j-1}' for i, j in zip(bins[:-1], bins[1:])]
df['g'] = pd.cut(df['Year'],bins=bins,labels=labels,include_lowest=True,right=False)
return (df.set_index('Year')
.groupby('g', observed=True)
.agg(Max_Year=('Number_of_photos','idxmax'),
Max_Num_Photos=('Number_of_photos','max'))
.reindex(df['g'].cat.categories.unique())
.rename_axis('Timeframe')
.reset_index())
print (max_photos(df, 1998, 4))
Timeframe Max_Year Max_Num_Photos
0 1998-2001 1998.0 40.0
1 2002-2005 2004.0 33.0
2 2006-2009 2006.0 24.0
3 2010-2013 NaN NaN
4 2014-2017 2015.0 17.0
EDIT: Here is solution for append new column Location
per maximal Max_Num_Photos
- first added to index in DataFrame.set_index
and last create new column from tuples by DataFrame.assign
:
#sample data
df['Location'] = 'loc' + df.index.astype(str)
print (df)
Year Number_of_photos Location
0 1990 27 loc0
1 1991 21 loc1
2 1992 39 loc2
3 1993 22 loc3
4 1994 10 loc4
5 1995 12 loc5
6 1996 22 loc6
7 1997 22 loc7
8 1998 40 loc8
9 1999 27 loc9
10 2000 18 loc10
11 2001 20 loc11
12 2002 18 loc12
13 2003 14 loc13
14 2004 33 loc14
15 2005 10 loc15
16 2006 24 loc16
17 2015 17 loc17
def max_photos(df,start,integer):
df = df[df['Year'] >= start].copy()
bins = range(start, df.Year.max() + integer, integer)
labels = [f'{i}-{j-1}' for i, j in zip(bins[:-1], bins[1:])]
df['g'] = pd.cut(df['Year'],bins=bins,labels=labels,include_lowest=True,right=False)
return (df.set_index(['Year','Location'])
.groupby('g', observed=True)
.agg(Max_Year=('Number_of_photos','idxmax'),
Max_Num_Photos=('Number_of_photos','max'))
.reindex(df['g'].cat.categories.unique())
.rename_axis('Timeframe')
.reset_index()
.assign(Loc=lambda x: x.Max_Year.str[1],
Max_Year=lambda x: x.Max_Year.str[0] ))
df = max_photos(df, 1998, 4)
print (df)
Timeframe Max_Year Max_Num_Photos Loc
0 1998-2001 1998.0 40.0 loc8
1 2002-2005 2004.0 33.0 loc14
2 2006-2009 2006.0 24.0 loc16
3 2010-2013 NaN NaN NaN
4 2014-2017 2015.0 17.0 loc17
Upvotes: 2
Reputation: 53
You could create a new column that indicates to which "cohort" each year belongs to: df['cohort'] = (df['year'] - start_year) // interval
(where interval
is what you called int
).
Then df.groupby('cohort')
gives your groups by cohorts of interval
years. From there you can get the maximum and the index that corresponds to the maximum (which you can transform to a year once again).
Edit: @jezrael's answer, using cut
, sounds like it does the same as this, but more elegantly :)
Upvotes: -1
Reputation: 29742
One way using pandas.DataFrame.resample
:
def idxmax(x):
if len(x):
return x.idxmax().strftime("%Y")
else:
return np.nan
def max_photos(df, start, integer):
df["Year"] = pd.to_datetime(df["Year"], format="%Y")
df = df.set_index("Year")
df = df[f"{start}":].resample(f"{integer}AS").agg([idxmax, max])
df.columns = ["Max_Year", "Max_Num_Photos"]
return df
max_photos(df, 1998, 4)
Output:
Max_Year Max_Num_Photos
Year
1998-01-01 1998 40.0
2002-01-01 2004 33.0
2006-01-01 2006 24.0
2010-01-01 NaN NaN
2014-01-01 2015 17.0
Upvotes: 0
Reputation: 1003
Try this:
def split_df(df, year, interval):
new_df = pd.DataFrame()
for i in range(year, df.Year.max(), interval):
temp_df = df.loc[df['Year'] >=i & df['Year'] <= i+interval-1]
index_of_largest = df['Number_of_photos'].idxmax()
max = df.iloc[index_of_largest]
temp_df['TimeFrame'] = str(Year)+"-"+str(Year+interval-1)
new_df = pd.concat([new_df,temp_df])
return new_df
Upvotes: -1