BeginnersLuck
BeginnersLuck

Reputation: 39

Create a function that will result in a dataframe, using time intervals

Let's say I have a df, memories

Year Number_of_photos
1990 27
1991 21
1992 39
1993 22
1994 10
1995 12
1996 22
1997 22
1998 40
1999 27
2000 18
2001 20
2002 18
2003 14
2004 33
2005 10
2006 24
2015 17

And I want to create a function that will spit out a data frame, where the user will input the start year and an integer, and that integer will determine the timeframes (i.e. if start year=1998 and int=4, then the timeframes will be 1998-2001, 2002-2005, 2006-2009,2010-2013, and 2014-2017). And within those timeframes, the function will generate the max Number_of_photos (i.e. 1998-2001 will generate 40, 2002-2005 will generate 33, 2006-2009 will generate 24, 2010-2013 will generate 'None', and lastly 2014-2017 will generate 17).

Any ideas on how I to define a function that will generate a data frame?

[In]  def max_photos(df,start,integer):



[Out] |Timeframe|Max_Year|Max_Num_Photos| 
      |---------|--------|--------------|
      |1998-2001|  1998  |      40      | 
      |2002-2005|  2004  |      33      | 
      |2006-2009|  2006  |      24      | 
      |2010-2013|  None  |     None     | 
      |2014-2017|  2015  |      17      | 

Upvotes: 2

Views: 492

Answers (4)

jezrael
jezrael

Reputation: 863301

Use cut with aggregate max and Series.idxmax, last for add missing categories add DataFrame.reindex:

def max_photos(df,start,integer):
    df = df[df['Year'] >= start].copy()
    bins = range(start, df.Year.max() + integer, integer)
    labels = [f'{i}-{j-1}' for i, j in zip(bins[:-1], bins[1:])]
    df['g'] = pd.cut(df['Year'],bins=bins,labels=labels,include_lowest=True,right=False) 


    return (df.set_index('Year')
              .groupby('g', observed=True)
              .agg(Max_Year=('Number_of_photos','idxmax'),
                   Max_Num_Photos=('Number_of_photos','max'))
              .reindex(df['g'].cat.categories.unique())
              .rename_axis('Timeframe')
              .reset_index())

print (max_photos(df, 1998, 4))
   Timeframe  Max_Year  Max_Num_Photos
0  1998-2001    1998.0            40.0
1  2002-2005    2004.0            33.0
2  2006-2009    2006.0            24.0
3  2010-2013       NaN             NaN
4  2014-2017    2015.0            17.0

EDIT: Here is solution for append new column Location per maximal Max_Num_Photos - first added to index in DataFrame.set_index and last create new column from tuples by DataFrame.assign:

#sample data
df['Location'] = 'loc' + df.index.astype(str)
print (df)
    Year  Number_of_photos Location
0   1990                27     loc0
1   1991                21     loc1
2   1992                39     loc2
3   1993                22     loc3
4   1994                10     loc4
5   1995                12     loc5
6   1996                22     loc6
7   1997                22     loc7
8   1998                40     loc8
9   1999                27     loc9
10  2000                18    loc10
11  2001                20    loc11
12  2002                18    loc12
13  2003                14    loc13
14  2004                33    loc14
15  2005                10    loc15
16  2006                24    loc16
17  2015                17    loc17

def max_photos(df,start,integer):
    df = df[df['Year'] >= start].copy()
    bins = range(start, df.Year.max() + integer, integer)
    labels = [f'{i}-{j-1}' for i, j in zip(bins[:-1], bins[1:])]
    df['g'] = pd.cut(df['Year'],bins=bins,labels=labels,include_lowest=True,right=False) 


    return (df.set_index(['Year','Location'])
              .groupby('g', observed=True)
              .agg(Max_Year=('Number_of_photos','idxmax'),
                   Max_Num_Photos=('Number_of_photos','max'))
               .reindex(df['g'].cat.categories.unique())
              .rename_axis('Timeframe')
              .reset_index()
              .assign(Loc=lambda x: x.Max_Year.str[1],
                      Max_Year=lambda x: x.Max_Year.str[0] ))

df = max_photos(df, 1998, 4)
print (df)
   Timeframe  Max_Year  Max_Num_Photos    Loc
0  1998-2001    1998.0            40.0   loc8
1  2002-2005    2004.0            33.0  loc14
2  2006-2009    2006.0            24.0  loc16
3  2010-2013       NaN             NaN    NaN
4  2014-2017    2015.0            17.0  loc17

Upvotes: 2

operte
operte

Reputation: 53

You could create a new column that indicates to which "cohort" each year belongs to: df['cohort'] = (df['year'] - start_year) // interval (where interval is what you called int). Then df.groupby('cohort') gives your groups by cohorts of interval years. From there you can get the maximum and the index that corresponds to the maximum (which you can transform to a year once again).

Edit: @jezrael's answer, using cut, sounds like it does the same as this, but more elegantly :)

Upvotes: -1

Chris
Chris

Reputation: 29742

One way using pandas.DataFrame.resample:

def idxmax(x):
    if len(x):
        return x.idxmax().strftime("%Y")
    else:
        return np.nan

def max_photos(df, start, integer):
    df["Year"] = pd.to_datetime(df["Year"], format="%Y")
    df = df.set_index("Year")
    df = df[f"{start}":].resample(f"{integer}AS").agg([idxmax, max])
    df.columns = ["Max_Year", "Max_Num_Photos"]
    return df

max_photos(df, 1998, 4)

Output:

           Max_Year  Max_Num_Photos
Year                               
1998-01-01     1998            40.0
2002-01-01     2004            33.0
2006-01-01     2006            24.0
2010-01-01      NaN             NaN
2014-01-01     2015            17.0

Upvotes: 0

Ofek Glick
Ofek Glick

Reputation: 1003

Try this:

def split_df(df, year, interval):
   new_df = pd.DataFrame()
   for i in range(year, df.Year.max(), interval):
       temp_df = df.loc[df['Year'] >=i & df['Year'] <= i+interval-1]
       index_of_largest = df['Number_of_photos'].idxmax()
       max = df.iloc[index_of_largest]
       temp_df['TimeFrame'] = str(Year)+"-"+str(Year+interval-1)
       new_df = pd.concat([new_df,temp_df])
   return new_df

Upvotes: -1

Related Questions