Alex
Alex

Reputation: 159

How to calculate shift and rolling sum over missing dates without adding them to data frame in Pandas?

I have a data set with dates, customers and income:

    Date    CustomerIncome
0   1/1/2018    A   53
1   2/1/2018    A   36
2   3/1/2018    A   53
3   5/1/2018    A   89
4   6/1/2018    A   84
5   8/1/2018    A   84
6   9/1/2018    A   54
7   10/1/2018   A   19
8   11/1/2018   A   44
9   12/1/2018   A   80
10  1/1/2018    B   24
11  2/1/2018    B   100
12  9/1/2018    B   40
13  10/1/2018   B   47
14  12/1/2018   B   10
15  2/1/2019    B   5

For both customers there are missing dates as they purchased nothing at some months. I want to add per each customer what was the income of last month and also the rolling sum of income for the last year. Meaning, if there's a missing month, I'll see '0' at the shift(1) column of the following month that has income. And I'll see rolling sum of 12 months even if there weren't 12 observations.

This is the expected result:

  Date  CustomerIncome  S(1)R(12)
0   1/1/2018    A   53  0   53
1   2/1/2018    A   36  53  89
2   3/1/2018    A   53  36  142
3   5/1/2018    A   89  0   231
4   6/1/2018    A   84  89  315
5   8/1/2018    A   84  0   399
6   9/1/2018    A   54  84  453
7   10/1/2018   A   19  54  472
8   11/1/2018   A   44  19  516
9   12/1/2018   A   80  44  596
10  1/1/2018    B   24  0   24
11  2/1/2018    B   100 24  124
12  9/1/2018    B   40  0   164
13  10/1/2018   B   47  40  211
14  12/1/2018   B   10  0   221
15  2/1/2019    B   5   0   102

So far, I've added the rows with missing dates with stack and unstack, but with multiple dates and customers, it explodes the data to millions of rows, crashing kernel with most rows are 0's.

Upvotes: 2

Views: 1063

Answers (1)

David Erickson
David Erickson

Reputation: 16683

  1. You can use .shift but have logic that if the gap is > 31 days, then make (S1) = 0
  2. The rolling 12 calculation requires figuring out the "Rolling Date" and doing some complicated list comprehension to decide whether or not to return a value. Then, take a sum of each list per row.

df['Date'] = pd.to_datetime(df['Date']).dt.date
df['S(1)'] = df.groupby('Customer')['Income'].transform('shift').fillna(0)
s = (df['Date'] - df['Date'].shift())/np.timedelta64(1, '31D') <= 1
df['S(1)'] = df['S(1)'].where(s,0).astype(int)
df['Rolling Date'] = (df['Date'] - pd.Timedelta('1Y'))
df['R(12)'] = df.apply(lambda d: sum([z for x,y,z in 
                                      zip(df['Customer'], df['Date'], df['Income']) 
                                      if y > d['Rolling Date'] 
                                      if y <= d['Date'] 
                                      if x == d['Customer']]), axis=1)
df = df.drop('Rolling Date', axis=1)
df
Out[1]: 
          Date Customer  Income  S(1)  R(12)
0   2018-01-01        A      53     0     53
1   2018-02-01        A      36    53     89
2   2018-03-01        A      53    36    142
3   2018-05-01        A      89     0    231
4   2018-06-01        A      84    89    315
5   2018-08-01        A      84     0    399
6   2018-09-01        A      54    84    453
7   2018-10-01        A      19    54    472
8   2018-11-01        A      44    19    516
9   2018-12-01        A      80    44    596
10  2018-01-01        B      24     0     24
11  2018-02-01        B     100    24    124
12  2018-09-01        B      40     0    164
13  2018-10-01        B      47    40    211
14  2018-12-01        B      10     0    221
15  2019-02-01        B       5     0    102

Upvotes: 1

Related Questions