Reputation: 179
I have a pandas dataframe (df) which is grouped by an index value. with repeated rows.
I would like to add a new column with a dates for each row, representing each day within the df.start_date and df.end_date for each group:
From this:
index start_date end_date number_of_days value
0 8/12/2019 10/12/2019 3 2091
0 8/12/2019 10/12/2019 3 2091
0 8/12/2019 10/12/2019 3 2091
1 11/12/2019 14/12/2019 4 2214
1 11/12/2019 14/12/2019 4 2214
1 11/12/2019 14/12/2019 4 2214
1 11/12/2019 14/12/2019 4 2214
2 15/12/2019 16/12/2019 2 8430
2 15/12/2019 16/12/2019 2 8430
To this:
index start_date end_date number_of_days value NEW_COLUMN
0 8/12/2019 10/12/2019 3 2091 8/12/2019
0 8/12/2019 10/12/2019 3 2091 9/12/2019
0 8/12/2019 10/12/2019 3 2091 10/12/2019
1 11/12/2019 14/12/2019 4 2214 11/12/2019
1 11/12/2019 14/12/2019 4 2214 12/12/2019
1 11/12/2019 14/12/2019 4 2214 13/12/2019
1 11/12/2019 14/12/2019 4 2214 14/12/2019
2 15/12/2019 16/12/2019 2 8430 15/12/2019
2 15/12/2019 16/12/2019 2 8430 16/12/2019
So far I been able to create a date range based on pd.date_range and then try to add that range as a new value to the dataframe..
#generate a daily date range based on min and max values
date_range_to_add = pd.date_range(start=min(df['start_date'])
,end=max(df['end_date']),freq='D')
repeated_dw_data['dates']= date_range_to_add
However, the error I get is that the length of values does not match length of index. Obviously as the date range generated is shorter than the full list of rows.
I think I am going to have to create a function to generate a date range for each group, add that and then append in group into a new dataframe. But before I go down this road, is there anything that I'm missing as a simpler route? What is the best way to interate over each group and applying this logic?
Upvotes: 1
Views: 1430
Reputation: 862431
If length of each group per index values is same like number_of_days
column solution should be create timedeltas of days by counter by GroupBy.cumcount
and to_timedelta
and add to start_date
column:
repeated_dw_data['start_date']=pd.to_datetime(repeated_dw_data['start_date'], dayfirst=True)
td = pd.to_timedelta(repeated_dw_data.groupby(level=0).cumcount(), unit='d')
repeated_dw_data['new'] = repeated_dw_data['start_date'] + td
print (repeated_dw_data)
start_date end_date number_of_days value new
index
0 2019-12-08 10/12/2019 3 2091 2019-12-08
0 2019-12-08 10/12/2019 3 2091 2019-12-09
0 2019-12-08 10/12/2019 3 2091 2019-12-10
1 2019-12-11 14/12/2019 4 2214 2019-12-11
1 2019-12-11 14/12/2019 4 2214 2019-12-12
1 2019-12-11 14/12/2019 4 2214 2019-12-13
1 2019-12-11 14/12/2019 4 2214 2019-12-14
2 2019-12-15 16/12/2019 2 8430 2019-12-15
2 2019-12-15 16/12/2019 2 8430 2019-12-16
Upvotes: 1