Ron
Ron

Reputation: 379

Summing a years worth of data that spans two years pandas

I have a DataFrame that contains data similar to this:

Name    Date        A   B   C
John    19/04/2018  10  11  8
John    20/04/2018  9   7   9
John    21/04/2018  22  15  22
…   …       …   …   …
John    16/04/2019  8   8   9
John    17/04/2019  10  11  18
John    18/04/2019  8   9   11
Rich    19/04/2018  18  7   6
…   …       …   …   …
Rich    18/04/2019  19  11  17

The data can start on any day and contains at least 365 days of data, sometimes more. What I want to end up with is a DataFrame like this:

Name    Date    Sum
John    April   356
John    May     276
John    June    209
Rich    April   452

I need to sum up all of the months to get a year’s worth of data (April - March) but I need to be able to handle taking part of April’s total (in this example) from 2018 and part from 2019. What I would also like to do is shift the days so they are consecutive and follow on in sequence so rather than:

John    16/04/2019  8   8   9   Tuesday
John    17/04/2019  10  11  18  Wednesday
John    18/04/2019  8   9   11  Thursday
John    19/04/2019  10  11  8   Thursday (was 19/04/2018)
John    20/04/2019  9   7   9   Friday (was 20/04/2018)

It becomes

John    16/04/2019  8   8   9   Tuesday
John    17/04/2019  10  11  18  Wednesday
John    18/04/2019  8   9   11  Thursday
John    19/04/2019  9   7   9   Friday (was 20/04/2018)

Prior to summing to get the final DataFrame. Is this possible?

Additional information requested in comments

Here is a link to the initial data set https://github.com/stottp/exampledata/blob/master/SOExample.csv and the required output would be:

Name Month Total 
John March  11634 
John April  11470 
John May    11757 
John June   10968 
John July   11682 
John August 11631 
John September 11085 
John October 11924 
John November 11593 
John December 11714 
John January 11320 
John February 10167 
Rich March 11594 
Rich April 12383 
Rich May 12506 
Rich June 11112 
Rich July 11636 
Rich August 11303 
Rich September 10667 
Rich October 10992 
Rich November 11721 
Rich December 11627 
Rich January 11669 
Rich February 10335

Upvotes: 0

Views: 76

Answers (1)

ottobricks
ottobricks

Reputation: 343

Let's see if I understood correctly. If you want to sum, I suppose you mean sum the values of columns ['A', 'B', 'C'] for each day and get the total value monthly.

If that's right, the first thing to to is set the ['Date'] column as the index so that the data frame is easier to work with:

df.set_index(df['Date'], inplace=True, drop=True)
del df['Date']

Next, you will want to add the new column ['Sum'] by re-sampling your data frame (from days to months) whilst summing the values of ['A', 'B', 'C']:

df['Sum'] = df['A'].resample('M').sum() + df['B'].resample('M').sum() + df['C'].resample('M').sum()
df['Sum'].head()

Out[37]: 
Date
2012-11-30    1956265
2012-12-31    2972076
2013-01-31    2972565
2013-02-28    2696121
2013-03-31    2970687
Freq: M, dtype: int64

The last part about squashing February of 2018 and 2019 together as if they were a single month might yield from:

df['2019-02'].merge(df['2018-02'], how='outer', on=['Date', 'A', 'B', 'C'])

Test this last step and see if it works for you. Cheers

Upvotes: 1

Related Questions