Reputation: 513
I'm trying to calculate the sum of quantity for every day for each combinaton of Profile-GeographicalZone-Town with the following sample df:
df = pd.DataFrame({
'Profile': {0: 'P014', 1: 'P014', 2: 'P012', 3: 'P012', 4: 'P012', 5: 'P012', 6: 'P012', 7: 'P012', 8: 'P012', 9: 'P012'},
'GeogaphicalZone': {0: 'NORTH', 1: 'NORTH', 2: 'NORTH', 3: 'SOUTH', 4: 'SOUTH', 5: 'SOUTH', 6: 'NORTH', 7: 'NORTH', 8: 'NORTH', 9: 'NORTH'},
'Town': {0: 'LONDON', 1: 'LONDON', 2: 'MANCHESTER', 3: 'MANCHESTER', 4: 'MANCHESTER', 5: 'MANCHESTER', 6: 'LIVERPOOL', 7: 'LIVERPOOL', 8: 'LIVERPOOL', 9: 'LONDON'},
'Quantity': {0: 8.202, 1: 8.202, 2: 8.202, 3: 60.645, 4: 60.645, 5: 60.645, 6: 90.925, 7: 162.373, 8: 45.095, 9: 78.832},
'StartDate': {0: '01/02/2019', 1: '01/01/2019', 2: '01/12/2018', 3: '01/11/2018', 4: '01/10/2018', 5: '01/09/2018', 6: '01/08/2018', 7: '01/07/2018', 8: '01/06/2018', 9: '01/05/2018'},
'EndDate': {0: '01/04/2020', 1: '01/05/2020', 2: '01/06/2020', 3: '01/07/2020', 4: '01/08/2020', 5: '01/09/2020', 6: '01/10/2020', 7: '01/11/2020', 8: '01/12/2020', 9: '01/01/2021'}
}
The Quantity is assumed to be the same during every day between Start and End Date
Now my desired output is to have the sum of Quantity for every Profile-GeographicalZone-Town between the min(StartDate) and the max(EndDate) for each combination.
for instance, for the combination P014-NORTH-LONDON, if I only show the days of Jan/Feb 2019, I expect to have something like this:
I think I should use a pivot table with pandas but I'm not sure how I should do with my Start/EndDate calculation.
I could do a first operation which would create a separate df with the quantity duplicated for all days between Start/EndDate and then apply a pivot table but I don't think this is very pythonic nor efficient. I feel there is something much suitable with pandas.
Is that possible ?
Thanks
Upvotes: 1
Views: 1343
Reputation: 93181
Exploding it into daily make a very long data frame but here's how you do it:
df = pd.DataFrame({
'Profile': {0: 'P014', 1: 'P014', 2: 'P012', 3: 'P012', 4: 'P012', 5: 'P012', 6: 'P012', 7: 'P012', 8: 'P012', 9: 'P012'},
'GeogaphicalZone': {0: 'NORTH', 1: 'NORTH', 2: 'NORTH', 3: 'SOUTH', 4: 'SOUTH', 5: 'SOUTH', 6: 'NORTH', 7: 'NORTH', 8: 'NORTH', 9: 'NORTH'},
'Town': {0: 'LONDON', 1: 'LONDON', 2: 'MANCHESTER', 3: 'MANCHESTER', 4: 'MANCHESTER', 5: 'MANCHESTER', 6: 'LIVERPOOL', 7: 'LIVERPOOL', 8: 'LIVERPOOL', 9: 'LONDON'},
'Quantity': {0: 8.202, 1: 8.202, 2: 8.202, 3: 60.645, 4: 60.645, 5: 60.645, 6: 90.925, 7: 162.373, 8: 45.095, 9: 78.832},
'StartDate': {0: '01/02/2019', 1: '01/01/2019', 2: '01/12/2018', 3: '01/11/2018', 4: '01/10/2018', 5: '01/09/2018', 6: '01/08/2018', 7: '01/07/2018', 8: '01/06/2018', 9: '01/05/2018'},
'EndDate': {0: '01/04/2020', 1: '01/05/2020', 2: '01/06/2020', 3: '01/07/2020', 4: '01/08/2020', 5: '01/09/2020', 6: '01/10/2020', 7: '01/11/2020', 8: '01/12/2020', 9: '01/01/2021'}
})
df['StartDate'] = pd.to_datetime(df['StartDate'])
df['EndDate'] = pd.to_datetime(df['EndDate'])
dates = df.apply(lambda row: pd.date_range(row['StartDate'], row['EndDate']).to_series(), axis=1) \
.stack() \
.droplevel(-1)
dates.name = 'Date'
df = df.join(dates)
Upvotes: 1