Reputation: 2871
I have a data frame as shown below
Unit_ID Unit_Create_Year
1 2011
2 2011
3 2012
4 2014
5 2012
6 2015
7 2017
8 2017
9 2019
From the above data frame I would like to prepare below data frame
Expected Output:
Year Number_of_Unit_Since_Year List_of_Units
2011 2 [1,2]
2012 4 [1,2,3,5]
2013 4 [1,2,3,5]
2014 5 [1,2,3,5,4]
2015 6 [1,2,3,5,4,6]
2016 6 [1,2,3,5,4,6]
2017 8 [1,2,3,5,4,6,7,8]
2018 8 [1,2,3,5,4,6,7,8]
2019 9 [1,2,3,5,4,6,7,8,9]
if a unit is created in 2011 it should be counted for all next years.
Steps: In 2011 two unit created '1' and '2'. In 2012 two unit created '3' and '5'. So in in 2012 will have total 4 units including the units in 2011.
Upvotes: 0
Views: 129
Reputation: 494
you can try this:
df_new = df.groupby(['Unit_Create_Year']).agg({'Unit_ID':['count','unique']}).reset_index()
df_new.columns = ['Year','Number_of_Unit_Since_Year','List_of_Units']
df_new['Number_of_Unit_Since_Year'] = df_new['Number_of_Unit_Since_Year'].cumsum()
df_new['List_of_Units'] = df_new['List_of_Units'].apply(lambda x : x.tolist()).cumsum()
df_new
Year Number_of_Unit_Since_Year List_of_Units
0 2011 2 [1, 2]
1 2012 4 [1, 2, 3, 5]
2 2014 5 [1, 2, 3, 5, 4]
3 2015 6 [1, 2, 3, 5, 4, 6]
4 2017 9 [1, 2, 3, 5, 4, 6, 7, 8, 9]
Upvotes: 2
Reputation: 13387
This should do the trick:
df2=pd.DataFrame(index=list(range(2011,2020)), columns=["Number_of_units_since_year"], data=[np.nan]*(2020-2011))
df=df.sort_values("Unit_Create_Year").set_index("Unit_Create_Year").expanding().count().reset_index().groupby("Unit_Create_Year").max()
df2.loc[df.index.values]=df
df2=df2.ffill().astype(int).reset_index().rename(columns={"index": "Year"})
Output:
Year Number_of_units_since_year
0 2011 2
1 2012 4
2 2013 4
3 2014 5
4 2015 6
5 2016 6
6 2017 9
7 2018 9
8 2019 9
Upvotes: 1
Reputation: 253
df = pd.DataFrame({
'unit_id' : [1, 2, 3, 4, 5, 6, 7, 8, 9],
'activity_gur' : [2011,2011,2012,2014,2012,2015,2017,2017,2017]})
def fill_number_of_unit_since_year(year):
return df[df['activity_gur'] == year]['unit_id'].nunique()
def fill_list_of_units(year):
return df[df['activity_gur'] <= year]['unit_id'].unique()
final_df = pd.DataFrame({'year' : df['activity_gur'].unique()})
final_df['number_of_unit_since_year'] = final_df['year'].apply(fill_number_of_unit_since_year)
final_df['number_of_unit_since_year'] = final_df['number_of_unit_since_year'].cumsum()
final_df['list_of_units'] = final_df['year'].apply(fill_list_of_units)
final_df
Upvotes: 2