Tingiskhan
Tingiskhan

Reputation: 1345

Complicated transformation pandas

Assume that I have the following data set

table = [[datetime.datetime(2015, 1, 1), 1, 0.5],
         [datetime.datetime(2015, 1, 27), 1, 0.5],
         [datetime.datetime(2015, 1, 31), 1, 0.5],
         [datetime.datetime(2015, 2, 1), 1, 2],
         [datetime.datetime(2015, 2, 3), 1, 2],
         [datetime.datetime(2015, 2, 15), 1, 2], 
         [datetime.datetime(2015, 2, 28), 1, 2],
         [datetime.datetime(2015, 3, 1), 1, 3],
         [datetime.datetime(2015, 3, 17), 1, 3],
         [datetime.datetime(2015, 3, 31), 1, 3]]

df = pd.DataFrame(table, columns=['Date', 'Id', 'Value'])

Now, I'd like to find the last value of each month, move it monthwise to the values of the next month, and finally take the cumulative product of these values. Doing this procedure for the above data should result in (performing each step):

  1. Finding the last entry of each month and moving them monthwise would result in

                Date       Id   Value  Temp
             0 2015-01-01   1    0.5    NaN
             1 2015-01-27   1    0.5    NaN
             2 2015-01-31   1    0.5    NaN
             3 2015-02-01   1    2.0    0.5
             4 2015-02-03   1    2.0    0.5
             5 2015-02-15   1    2.0    0.5
             6 2015-02-28   1    2.0    0.5
             7 2015-03-01   1    3.0    2.0
             8 2015-03-17   1    3.0    2.0
             9 2015-03-31   1    3.0    2.0
    
  2. Filling the NaN's with 1, taking the cumulative product, and dropping temp would result in

                 Date      Id   Value  Result
             0 2015-01-01   1    0.5     1
             1 2015-01-27   1    0.5     1
             2 2015-01-31   1    0.5     1
             3 2015-02-01   1    2.0    0.5
             4 2015-02-03   1    2.0    0.5
             5 2015-02-15   1    2.0    0.5
             6 2015-02-28   1    2.0    0.5
             7 2015-03-01   1    3.0    1.0
             8 2015-03-17   1    3.0    1.0
             9 2015-03-31   1    3.0    1.0
    

I hope that this is clear enough. And if anyone wonders why on earth I'd like to do this is because I've got MTD-data, and it needs to be resampled. Thanks, Tingis.

edit The number of entries per month are "random", as in they can either be as long as the month or shorter (business data...)

Upvotes: 2

Views: 54

Answers (1)

Jianxun Li
Jianxun Li

Reputation: 24742

The following code does not assume that you have only two rows for each month. The idea is that do the group-wise calculation first, and then populate some NaN using .reindex() and fill those NaNs using backward fill as we've got the value for the very last entry of each month.

# your data
# ==================================
import pandas as pd
import datetime

table = [[datetime.datetime(2015, 1, 1), 1, 0.5],
         [datetime.datetime(2015, 1, 31), 1, 0.5],
         [datetime.datetime(2015, 2, 1), 1, 2], 
         [datetime.datetime(2015, 2, 28), 1, 2],
         [datetime.datetime(2015, 3, 1), 1, 3],
         [datetime.datetime(2015, 3, 31), 1, 3]]

df = pd.DataFrame(table, columns=['Date', 'Id', 'Value'])

# better to set Date column to index
df = df.set_index('Date')
print(df)

            Id  Value
Date                 
2015-01-01   1    0.5
2015-01-31   1    0.5
2015-02-01   1    2.0
2015-02-28   1    2.0
2015-03-01   1    3.0
2015-03-31   1    3.0

# processing
# =================================================
# get last entry from each month
df_temp = df.groupby(lambda idx: idx.month).tail(1)
# do the cumprod, reindex to have the same index as original df, backward fill
df['Result'] = df_temp['Value'].shift(1).fillna(1).cumprod().reindex(df.index).fillna(method='bfill')

print(df)

            Id  Value  Result
Date                         
2015-01-01   1    0.5     1.0
2015-01-31   1    0.5     1.0
2015-02-01   1    2.0     0.5
2015-02-28   1    2.0     0.5
2015-03-01   1    3.0     1.0
2015-03-31   1    3.0     1.0

For the follow-ups question:

# your data
# ==================================
import pandas as pd
import datetime

table = [[datetime.datetime(2015, 1, 1), 1, 0.5],
         [datetime.datetime(2015, 1, 27), 1, 0.5],
         [datetime.datetime(2015, 1, 31), 1, 0.5],
         [datetime.datetime(2015, 2, 1), 1, 2],
         [datetime.datetime(2015, 2, 3), 1, 2],
         [datetime.datetime(2015, 2, 15), 1, 2], 
         [datetime.datetime(2015, 2, 28), 1, 2],
         [datetime.datetime(2015, 3, 1), 1, 3],
         [datetime.datetime(2015, 3, 17), 1, 3],
         [datetime.datetime(2015, 3, 31), 1, 3]]

df1 = pd.DataFrame(table, columns=['Date', 'Id', 'Value'])
df2 = df1.copy()
df2.Id = 2

df = df1.append(df2)
# better to set Date column to index
df = df.set_index('Date')

print(df)


            Id  Value
Date                 
2015-01-01   1    0.5
2015-01-27   1    0.5
2015-01-31   1    0.5
2015-02-01   1    2.0
2015-02-03   1    2.0
2015-02-15   1    2.0
2015-02-28   1    2.0
2015-03-01   1    3.0
2015-03-17   1    3.0
2015-03-31   1    3.0
2015-01-01   2    0.5
2015-01-27   2    0.5
2015-01-31   2    0.5
2015-02-01   2    2.0
2015-02-03   2    2.0
2015-02-15   2    2.0
2015-02-28   2    2.0
2015-03-01   2    3.0
2015-03-17   2    3.0
2015-03-31   2    3.0

def my_func(group):
    # get last entry from each month
    df_temp = group.groupby(lambda idx: idx.month).tail(1)
    # do the cumprod, reindex to have the same index as original df
    group['Result'] = df_temp['Value'].shift(1).fillna(1).cumprod().reindex(group.index).fillna(method='bfill')
    return group

df.groupby('Id').apply(my_func)



            Id  Value  Result
Date                         
2015-01-01   1    0.5     1.0
2015-01-27   1    0.5     1.0
2015-01-31   1    0.5     1.0
2015-02-01   1    2.0     0.5
2015-02-03   1    2.0     0.5
2015-02-15   1    2.0     0.5
2015-02-28   1    2.0     0.5
2015-03-01   1    3.0     1.0
2015-03-17   1    3.0     1.0
2015-03-31   1    3.0     1.0
2015-01-01   2    0.5     1.0
2015-01-27   2    0.5     1.0
2015-01-31   2    0.5     1.0
2015-02-01   2    2.0     0.5
2015-02-03   2    2.0     0.5
2015-02-15   2    2.0     0.5
2015-02-28   2    2.0     0.5
2015-03-01   2    3.0     1.0
2015-03-17   2    3.0     1.0
2015-03-31   2    3.0     1.0

Upvotes: 1

Related Questions