Copy value into n previous cells of column in dataframe based on ID and date in Python/R

Question

I'm trying figure out the best way to populate a column in a DataFrame based off of the values in a combination of the remaining columns.

I want to create a column v2, such that every time a 1 is encountered in v1, the previous 3 dates and the date at which the 1 was encountered, for the same ID only, should be filled with 1s in the new column.

This is what the DataFrame would look like originally:

        date    id  v1
0  2017-05-29  5206   0
1  2017-05-30  5206   0
2  2017-05-31  5206   0
3  2017-06-01  5206   0
4  2017-06-02  5206   0
5  2017-06-03  5206   0
6  2017-06-04  5206   1
7  2017-05-29  5207   0
8  2017-05-30  5207   1
9  2017-05-31  5207   0
10 2017-06-01  5207   1
11 2017-06-02  5207   0
12 2017-06-03  5207   0
13 2017-06-04  5207   0
14 2017-05-29  5208   0
15 2017-05-30  5208   1
16 2017-05-31  5208   0
17 2017-06-01  5208   0
18 2017-06-02  5208   0
19 2017-06-03  5208   0
20 2017-06-04  5208   1

And what I want is:

         date    id  v1  v2
0  2017-05-29  5206   0   0
1  2017-05-30  5206   0   0
2  2017-05-31  5206   0   0
3  2017-06-01  5206   0   1
4  2017-06-02  5206   0   1
5  2017-06-03  5206   0   1
6  2017-06-04  5206   1   1
7  2017-05-29  5207   0   1
8  2017-05-30  5207   1   1
9  2017-05-31  5207   0   1
10 2017-06-01  5207   1   1
11 2017-06-02  5207   0   0
12 2017-06-03  5207   0   0
13 2017-06-04  5207   0   0
14 2017-05-29  5208   0   1
15 2017-05-30  5208   1   1
16 2017-05-31  5208   0   0
17 2017-06-01  5208   0   1
18 2017-06-02  5208   0   1
19 2017-06-03  5208   0   1
20 2017-06-04  5208   1   1

The code to recreate the original sample DataFrame in Python is below:

import pandas as pd
import datetime
base = datetime.datetime.today()-datetime.timedelta(days=500)

df=pd.DataFrame()
df['date']=3*[base + datetime.timedelta(days=x) for x in range(0, 7)]
df['date']=df['date'].dt.floor('d')
df['id']=sorted(7*list(range(5206,5209)))
df['v1']=[0,0,0,0,0,0,1,
  0,1,0,1,0,0,0,
  0,1,0,0,0,0,1]

Please also note that I don't mind if this is done using R, as I have no preference between R and Python for this task.

Frank · Accepted Answer

With R and data.table:

library(data.table)
setDT(DF)

DF[, v := do.call(pmax, shift(v1, 0:3, type="lead", fill=0L)), by=id]

          date   id v1 v2 v
 1: 2017-05-29 5206  0  0 0
 2: 2017-05-30 5206  0  0 0
 3: 2017-05-31 5206  0  0 0
 4: 2017-06-01 5206  0  1 1
 5: 2017-06-02 5206  0  1 1
 6: 2017-06-03 5206  0  1 1
 7: 2017-06-04 5206  1  1 1
 8: 2017-05-29 5207  0  1 1
 9: 2017-05-30 5207  1  1 1
10: 2017-05-31 5207  0  1 1
11: 2017-06-01 5207  1  1 1
12: 2017-06-02 5207  0  0 0
13: 2017-06-03 5207  0  0 0
14: 2017-06-04 5207  0  0 0
15: 2017-05-29 5208  0  1 1
16: 2017-05-30 5208  1  1 1
17: 2017-05-31 5208  0  0 0
18: 2017-06-01 5208  0  1 1
19: 2017-06-02 5208  0  1 1
20: 2017-06-03 5208  0  1 1
21: 2017-06-04 5208  1  1 1
          date   id v1 v2 v

How it works: shift with type "lead" looks ahead, in this case at distances of 0, 1, 2 or 3 (with undefined values replaced with zero). pmax looks for the maximum value across these vectors, elementwise.

Similarly, from @RyanD's comment:

DF[order(date), v := 
  do.call(pmax, shift(v1, 0:3, type="lead", fill=0L))
, by=id]

This has the advantage that it works even if the data is not sorted by date. It sorts the data temporarily just while constructing the column.

Alternately, do a rolling join:

DF[, date := as.IDate(date)] # format
DF[, v := DF[v1 == 1][.SD, on=.(id, date), roll=-3, .N, by=.EACHI]$N]

This has the advantage that it works even if the enumeration of dates is incomplete. It looks up each row of DF in DF[v1 == 1], counting any match 0-3 days in the future.

Data:

DF = structure(list(date = c("2017-05-29", "2017-05-30", "2017-05-31", 
"2017-06-01", "2017-06-02", "2017-06-03", "2017-06-04", "2017-05-29", 
"2017-05-30", "2017-05-31", "2017-06-01", "2017-06-02", "2017-06-03", 
"2017-06-04", "2017-05-29", "2017-05-30", "2017-05-31", "2017-06-01", 
"2017-06-02", "2017-06-03", "2017-06-04"), id = c(5206L, 5206L, 
5206L, 5206L, 5206L, 5206L, 5206L, 5207L, 5207L, 5207L, 5207L, 
5207L, 5207L, 5207L, 5208L, 5208L, 5208L, 5208L, 5208L, 5208L, 
5208L), v1 = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L), v2 = c(0L, 0L, 0L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L
)), .Names = c("date", "id", "v1", "v2"), row.names = c(NA, -21L
), class = "data.frame")

Copy value into n previous cells of column in dataframe based on ID and date in Python/R

Answers (2)

Related Questions