Swasti
Swasti

Reputation: 287

Add a new dataframe column which counts the values in certain column less than the date prior to the time

EDITED

I want to add a new column called prev_message_left which counts the no. of messages_left per ID less than the date prior the given time. Basically I want to have a column which says how many times we had left message on call to that customer prior to the current time and date. This is how my data frame looks like

date       ID   call_time   message_left  
20191101    1        8:00              0
20191102    2        9:00              1
20191030    1       16:00              1
20191103    2       10:30              1
20191105    2       14:00              0
20191030    1       15:30              0

I want to add an additional column called prev_message_left_count

date       ID   call_time   message_left  prev_message_left_count
20191101    1        8:00              0                        1
20191102    2        9:00              1                        0
20191030    1       16:00              1                        0
20191103    2       10:30              1                        1
20191105    2       14:00              0                        2
20191030    1       15:30              0                        0

My dataframe has 15 columns and 90k rows.
I have various other columns in this dataframe and there are columns like 'No Message Left', 'Responded' for which I will have to compute additional columns called 'Previous_no_message_left' and 'prev_responded' similar to 'prev_message_left'

Upvotes: 2

Views: 64

Answers (1)

ansev
ansev

Reputation: 30920

Use DataFrame.sort_values to get the cumulative sum in the correct order by groups. You can create groups using DataFrame.groupby:

df['prev_message_left_count']= (df.sort_values(['date','call_time'])
                                  .groupby('ID')['message_left']
                                  .apply(lambda x: x.shift(fill_value=0)
                                                  .cumsum()) )

print(df)


       date  ID call_time  message_left  prev_message_left_count
0  20191101   1      8:00             0                        1
1  20191102   2      9:00             1                        0
2  20191030   1     16:00             1                        0
3  20191103   2     10:30             1                        1
4  20191105   2     14:00             0                        2
5  20191030   1     15:30             0                        0

sometimes GroupBy.apply is slow so it may be advisable

df['prev_message_left_count']=( df.sort_values(['date','call_time'])
                                  .groupby('ID')
                                  .shift(fill_value=0)
                                  .groupby(df['ID'])['message_left']
                                  .cumsum()

Upvotes: 1

Related Questions