MMM
MMM

Reputation: 111

How to ungroup a sum on a time series forward

I need to check if a participant had at least one token per day during a period of time, the data comes grouped by day and sum and every day a participant consumes one token.

I need to find if even when taking into account stacked tokens (sum) the participant had tokens for all the period min(date) to max(date) + sum.

A participant can only fill the gap (meaning he has no tokens for that day) if a token is missing if he stacked in the past, e.g if I had two tokens yesterday I was covered for yesterday and today.

If I have a dataframe like this

date sum
2021-01-01 2
2021-01-02 4
2021-01-05 1

How can I unstack the dates to represent the range of date + sum days, and create as many rows as possible extending from first date to last date? I start with:

|    date   | sum |
|  -------- | --- |
| 2021-01-01| 1   | #this comes from the first row
| 2021-01-02| 2   | #this comes from the second and first row (overlap)
| 2021-01-03| 1   | #this comes from the second row
| 2021-01-04| 1   | #this comes from the second row
| 2021-01-05| 2   | #this comes from the second row and third row(overlap)

Now because I have still two stacked with sum > 1 I keep moving forward to fill 0 (if any) or extend the series(this is the case):

|    date   | sum |
|  -------- | --- |
| 2021-01-01| 1   | 
| 2021-01-02| 1   | 
| 2021-01-03| 1   | 
| 2021-01-04| 1   | 
| 2021-01-05| 1   | 
| 2021-01-06| 1   |  #this is added from the previously stacked on 01-02.
| 2021-01-07| 1   |  #this is added from previously stacked on 01-05.

Note that it can be that there are not enough stacked dates on sum as per fill all days, then it has to be 0 or a NA:

date sum
2021-01-01 3
2021-01-06 1

it becomes

|    date   | sum |
|  -------- | --- |
| 2021-01-01| 1   | #This comes from the first row
| 2021-01-02| 1   | #This comes from the first row 
| 2021-01-03| 1   | #This comes from the first row
| 2021-01-04| 0   | #There is not enough sum to fill until here (3 stacked)
| 2021-01-05| 0   | #There is not enough sum to fill until here (3 stacked)
| 2021-01-06| 1   | #This comes from the second row.

I have tried doing something like pd.date_range(date,date+sum) and explode, but not sure.

@Scott Your solution will not work in this case

|    | date                |   sum |
|---:|:--------------------|------:|
|  0 | 2021-01-01 00:00:00 |     2 |
|  1 | 2021-01-02 00:00:00 |     2 |
|  2 | 2021-01-03 00:00:00 |     2 |
|  3 | 2021-01-06 00:00:00 |     1 |
|  4 | 2021-01-09 00:00:00 |     2 |
|  5 | 2021-01-10 00:00:00 |     1 |

The gap from 4-5 should be filled by the exceeding sum from the 1th to the 3rd, which adds up to 3. After taking out the 4th and 5th tokens we still have one left which should fill the 7th. For the gap of the 8th we do not have any past token left. The expected output should be all the comment rows with # this should be filled to be ones, but it doesn't work.

 | date                |   sum |   left_after_consume |
 |:--------------------|------:|---------------------:|
 | 2021-01-01 00:00:00 |     2 |                    1 |
 | 2021-01-02 00:00:00 |     2 |                    1 |
 | 2021-01-03 00:00:00 |     2 |                    1 |
 | 2021-01-04 00:00:00 |   nan |                  nan | #this can be filled
 | 2021-01-05 00:00:00 |   nan |                  nan | #this can be filled
 | 2021-01-06 00:00:00 |     1 |                    0 |
 | 2021-01-07 00:00:00 |   nan |                  nan |  #this can be filled
 | 2021-01-08 00:00:00 |   nan |                  nan | #the three stacked tokens are out
 | 2021-01-09 00:00:00 |     2 |                    1 | # this will carry and create a new day at the end.
 | 2021-01-10 00:00:00 |     1 |                    0 |

This is the dataset used:

data = pd.DataFrame({'date':['2021-01-01','2021-01-02','2022-01-03','2021-01-06','2021-01-09','2021-01-10'],'sum':[2,2,2,1,2,1]})

Another example breaking is using the Problem 1 but adding a few days at the end (from 5th to 10th):

df = pd.DataFrame({'date':['2021-01-01', '2021-01-02', '2021-01-10'], 
              'sum':[2,4,1]})

df['date'] = pd.to_datetime(df['date'])    
df['date'] = [pd.date_range(s, periods=n) for s, n in zip(df['date'], df['sum'])]    
df_out = df.explode('date').groupby('date').count()

n = df_out['sum'].sum() - len(df_out)
df_out.reindex(df_out.index.union(pd.date_range(df_out.index[-1], periods=n+1)), fill_value=1)\
      .assign(sum=1)\
      .resample('D')\
      .asfreq()\
      .fillna(0)

THe person doesnt use the first row supplies until the 2nd day and then has 4 days to go 3,4,5,6 and from there no more refills until the 10th, which is burned the same day so it would be 7 days in total with tokens but it doesnt add the toke on the 6th and add its the 11th:

'|                     |   sum |
 |:--------------------|------:|
 | 2021-01-01 00:00:00 |     1 |
 | 2021-01-02 00:00:00 |     1 |
 | 2021-01-03 00:00:00 |     1 |
 | 2021-01-04 00:00:00 |     1 |
 | 2021-01-05 00:00:00 |     1 |
 | 2021-01-06 00:00:00 |     0 |
 | 2021-01-07 00:00:00 |     0 |
 | 2021-01-08 00:00:00 |     0 |
 | 2021-01-09 00:00:00 |     0 |
 | 2021-01-10 00:00:00 |     1 |
 | 2021-01-11 00:00:00 |     1 |'

Upvotes: 2

Views: 71

Answers (1)

Scott Boston
Scott Boston

Reputation: 153500

Let's try this:

Problem 1:

df = pd.DataFrame({'date':['2021-01-01', '2021-01-02', '2021-01-05'], 
                  'sum':[2,4,1]})

df['date'] = pd.to_datetime(df['date'])    
df['date'] = [pd.date_range(s, periods=n) for s, n in zip(df['date'], df['sum'])]    
df_out = df.explode('date').groupby('date').count()

n = df_out['sum'].sum() - len(df_out)
df_out.reindex(df_out.index.union(pd.date_range(df_out.index[-1], periods=n+1)), fill_value=1)\
      .assign(sum=1)\
      .resample('D')\
      .asfreq()\
      .fillna(0)

Output:

            sum
2021-01-01    1
2021-01-02    1
2021-01-03    1
2021-01-04    1
2021-01-05    1
2021-01-06    1
2021-01-07    1

Problem 2:

df1 = pd.DataFrame({'date':['2021-01-01', '2021-01-06'], 
                   'sum':[3,1]})

df1['date'] = pd.to_datetime(df1['date'])
df1['date'] = [pd.date_range(s, periods=n) for s, n in zip(df1['date'], df1['sum'])]
df1_out = df1.explode('date').groupby('date').count()


n = df1_out['sum'].sum() - len(df1_out)
df1_out.reindex(df1_out.index.union(pd.date_range(df1_out.index[-1], periods=n+1)), fill_value=1)\
       .assign(sum=1)\
       .resample('D')\
       .asfreq()\
       .fillna(0)

Output:

            sum
2021-01-01  1.0
2021-01-02  1.0
2021-01-03  1.0
2021-01-04  0.0
2021-01-05  0.0
2021-01-06  1.0

Upvotes: 1

Related Questions