Danish
Danish

Reputation: 2871

Cumulative count based on year column in pandas

I have a data frame as shown below

Unit_ID             Unit_Create_Year
1                   2011
2                   2011
3                   2012
4                   2014
5                   2012
6                   2015
7                   2017
8                   2017
9                   2019

From the above data frame I would like to prepare below data frame

Expected Output:

Year         Number_of_Unit_Since_Year       List_of_Units
2011         2                               [1,2]
2012         4                               [1,2,3,5]
2013         4                               [1,2,3,5]
2014         5                               [1,2,3,5,4]
2015         6                               [1,2,3,5,4,6]
2016         6                               [1,2,3,5,4,6]
2017         8                               [1,2,3,5,4,6,7,8]
2018         8                               [1,2,3,5,4,6,7,8]
2019         9                               [1,2,3,5,4,6,7,8,9]

if a unit is created in 2011 it should be counted for all next years.

Steps: In 2011 two unit created '1' and '2'. In 2012 two unit created '3' and '5'. So in in 2012 will have total 4 units including the units in 2011.

Upvotes: 0

Views: 129

Answers (3)

yganalyst
yganalyst

Reputation: 494

you can try this:

df_new = df.groupby(['Unit_Create_Year']).agg({'Unit_ID':['count','unique']}).reset_index()
df_new.columns = ['Year','Number_of_Unit_Since_Year','List_of_Units']
df_new['Number_of_Unit_Since_Year'] = df_new['Number_of_Unit_Since_Year'].cumsum()
df_new['List_of_Units'] = df_new['List_of_Units'].apply(lambda x : x.tolist()).cumsum()

df_new


   Year  Number_of_Unit_Since_Year                List_of_Units
0  2011                          2                       [1, 2]
1  2012                          4                 [1, 2, 3, 5]
2  2014                          5              [1, 2, 3, 5, 4]
3  2015                          6           [1, 2, 3, 5, 4, 6]
4  2017                          9  [1, 2, 3, 5, 4, 6, 7, 8, 9]

Upvotes: 2

Georgina Skibinski
Georgina Skibinski

Reputation: 13387

This should do the trick:

df2=pd.DataFrame(index=list(range(2011,2020)), columns=["Number_of_units_since_year"], data=[np.nan]*(2020-2011))

df=df.sort_values("Unit_Create_Year").set_index("Unit_Create_Year").expanding().count().reset_index().groupby("Unit_Create_Year").max()

df2.loc[df.index.values]=df

df2=df2.ffill().astype(int).reset_index().rename(columns={"index": "Year"})

Output:

   Year  Number_of_units_since_year
0  2011                           2
1  2012                           4
2  2013                           4
3  2014                           5
4  2015                           6
5  2016                           6
6  2017                           9
7  2018                           9
8  2019                           9

Upvotes: 1

Ivan  Dembicki
Ivan Dembicki

Reputation: 253

df = pd.DataFrame({
    'unit_id' : [1, 2, 3, 4, 5, 6, 7, 8, 9],
    'activity_gur' : [2011,2011,2012,2014,2012,2015,2017,2017,2017]})

def fill_number_of_unit_since_year(year):
    return df[df['activity_gur'] == year]['unit_id'].nunique()

def fill_list_of_units(year):
    return df[df['activity_gur'] <= year]['unit_id'].unique()

final_df = pd.DataFrame({'year' : df['activity_gur'].unique()})
final_df['number_of_unit_since_year'] = final_df['year'].apply(fill_number_of_unit_since_year)
final_df['number_of_unit_since_year'] = final_df['number_of_unit_since_year'].cumsum()
final_df['list_of_units'] = final_df['year'].apply(fill_list_of_units)
final_df

enter image description here

Upvotes: 2

Related Questions