Reputation: 39

Create a function that will result in a dataframe, using time intervals

Let's say I have a df, memories

Year	Number_of_photos
1990	27
1991	21
1992	39
1993	22
1994	10
1995	12
1996	22
1997	22
1998	40
1999	27
2000	18
2001	20
2002	18
2003	14
2004	33
2005	10
2006	24
2015	17

And I want to create a function that will spit out a data frame, where the user will input the start year and an integer, and that integer will determine the timeframes (i.e. if start year=1998 and int=4, then the timeframes will be 1998-2001, 2002-2005, 2006-2009,2010-2013, and 2014-2017). And within those timeframes, the function will generate the max Number_of_photos (i.e. 1998-2001 will generate 40, 2002-2005 will generate 33, 2006-2009 will generate 24, 2010-2013 will generate 'None', and lastly 2014-2017 will generate 17).

Any ideas on how I to define a function that will generate a data frame?

[In]  def max_photos(df,start,integer):



[Out] |Timeframe|Max_Year|Max_Num_Photos| 
      |---------|--------|--------------|
      |1998-2001|  1998  |      40      | 
      |2002-2005|  2004  |      33      | 
      |2006-2009|  2006  |      24      | 
      |2010-2013|  None  |     None     | 
      |2014-2017|  2015  |      17      |

Upvotes: 2

Answers (4)

jezrael

Reputation: 863301

Use cut with aggregate max and Series.idxmax, last for add missing categories add DataFrame.reindex:

def max_photos(df,start,integer):
    df = df[df['Year'] >= start].copy()
    bins = range(start, df.Year.max() + integer, integer)
    labels = [f'{i}-{j-1}' for i, j in zip(bins[:-1], bins[1:])]
    df['g'] = pd.cut(df['Year'],bins=bins,labels=labels,include_lowest=True,right=False) 


    return (df.set_index('Year')
              .groupby('g', observed=True)
              .agg(Max_Year=('Number_of_photos','idxmax'),
                   Max_Num_Photos=('Number_of_photos','max'))
              .reindex(df['g'].cat.categories.unique())
              .rename_axis('Timeframe')
              .reset_index())

print (max_photos(df, 1998, 4))
   Timeframe  Max_Year  Max_Num_Photos
0  1998-2001    1998.0            40.0
1  2002-2005    2004.0            33.0
2  2006-2009    2006.0            24.0
3  2010-2013       NaN             NaN
4  2014-2017    2015.0            17.0

EDIT: Here is solution for append new column Location per maximal Max_Num_Photos - first added to index in DataFrame.set_index and last create new column from tuples by DataFrame.assign:

#sample data
df['Location'] = 'loc' + df.index.astype(str)
print (df)
    Year  Number_of_photos Location
0   1990                27     loc0
1   1991                21     loc1
2   1992                39     loc2
3   1993                22     loc3
4   1994                10     loc4
5   1995                12     loc5
6   1996                22     loc6
7   1997                22     loc7
8   1998                40     loc8
9   1999                27     loc9
10  2000                18    loc10
11  2001                20    loc11
12  2002                18    loc12
13  2003                14    loc13
14  2004                33    loc14
15  2005                10    loc15
16  2006                24    loc16
17  2015                17    loc17

def max_photos(df,start,integer):
    df = df[df['Year'] >= start].copy()
    bins = range(start, df.Year.max() + integer, integer)
    labels = [f'{i}-{j-1}' for i, j in zip(bins[:-1], bins[1:])]
    df['g'] = pd.cut(df['Year'],bins=bins,labels=labels,include_lowest=True,right=False) 


    return (df.set_index(['Year','Location'])
              .groupby('g', observed=True)
              .agg(Max_Year=('Number_of_photos','idxmax'),
                   Max_Num_Photos=('Number_of_photos','max'))
               .reindex(df['g'].cat.categories.unique())
              .rename_axis('Timeframe')
              .reset_index()
              .assign(Loc=lambda x: x.Max_Year.str[1],
                      Max_Year=lambda x: x.Max_Year.str[0] ))

df = max_photos(df, 1998, 4)
print (df)
   Timeframe  Max_Year  Max_Num_Photos    Loc
0  1998-2001    1998.0            40.0   loc8
1  2002-2005    2004.0            33.0  loc14
2  2006-2009    2006.0            24.0  loc16
3  2010-2013       NaN             NaN    NaN
4  2014-2017    2015.0            17.0  loc17

Upvotes: 2

operte

Reputation: 53

You could create a new column that indicates to which "cohort" each year belongs to: df['cohort'] = (df['year'] - start_year) // interval (where interval is what you called int). Then df.groupby('cohort') gives your groups by cohorts of interval years. From there you can get the maximum and the index that corresponds to the maximum (which you can transform to a year once again).

Edit: @jezrael's answer, using cut, sounds like it does the same as this, but more elegantly :)

Upvotes: -1

Chris

Reputation: 29742

One way using pandas.DataFrame.resample:

def idxmax(x):
    if len(x):
        return x.idxmax().strftime("%Y")
    else:
        return np.nan

def max_photos(df, start, integer):
    df["Year"] = pd.to_datetime(df["Year"], format="%Y")
    df = df.set_index("Year")
    df = df[f"{start}":].resample(f"{integer}AS").agg([idxmax, max])
    df.columns = ["Max_Year", "Max_Num_Photos"]
    return df

max_photos(df, 1998, 4)

Output:

           Max_Year  Max_Num_Photos
Year                               
1998-01-01     1998            40.0
2002-01-01     2004            33.0
2006-01-01     2006            24.0
2010-01-01      NaN             NaN
2014-01-01     2015            17.0

Upvotes: 0

Ofek Glick

Reputation: 1003

Try this:

def split_df(df, year, interval):
   new_df = pd.DataFrame()
   for i in range(year, df.Year.max(), interval):
       temp_df = df.loc[df['Year'] >=i & df['Year'] <= i+interval-1]
       index_of_largest = df['Number_of_photos'].idxmax()
       max = df.iloc[index_of_largest]
       temp_df['TimeFrame'] = str(Year)+"-"+str(Year+interval-1)
       new_df = pd.concat([new_df,temp_df])
   return new_df

Upvotes: -1

Create a function that will result in a dataframe, using time intervals

Answers (4)

Related Questions