Reputation: 75
I'm trying figure out the best way to populate a column in a DataFrame based off of the values in a combination of the remaining columns.
I want to create a column v2, such that every time a 1 is encountered in v1, the previous 3 dates and the date at which the 1 was encountered, for the same ID only, should be filled with 1s in the new column.
This is what the DataFrame would look like originally:
date id v1
0 2017-05-29 5206 0
1 2017-05-30 5206 0
2 2017-05-31 5206 0
3 2017-06-01 5206 0
4 2017-06-02 5206 0
5 2017-06-03 5206 0
6 2017-06-04 5206 1
7 2017-05-29 5207 0
8 2017-05-30 5207 1
9 2017-05-31 5207 0
10 2017-06-01 5207 1
11 2017-06-02 5207 0
12 2017-06-03 5207 0
13 2017-06-04 5207 0
14 2017-05-29 5208 0
15 2017-05-30 5208 1
16 2017-05-31 5208 0
17 2017-06-01 5208 0
18 2017-06-02 5208 0
19 2017-06-03 5208 0
20 2017-06-04 5208 1
And what I want is:
date id v1 v2
0 2017-05-29 5206 0 0
1 2017-05-30 5206 0 0
2 2017-05-31 5206 0 0
3 2017-06-01 5206 0 1
4 2017-06-02 5206 0 1
5 2017-06-03 5206 0 1
6 2017-06-04 5206 1 1
7 2017-05-29 5207 0 1
8 2017-05-30 5207 1 1
9 2017-05-31 5207 0 1
10 2017-06-01 5207 1 1
11 2017-06-02 5207 0 0
12 2017-06-03 5207 0 0
13 2017-06-04 5207 0 0
14 2017-05-29 5208 0 1
15 2017-05-30 5208 1 1
16 2017-05-31 5208 0 0
17 2017-06-01 5208 0 1
18 2017-06-02 5208 0 1
19 2017-06-03 5208 0 1
20 2017-06-04 5208 1 1
The code to recreate the original sample DataFrame in Python is below:
import pandas as pd
import datetime
base = datetime.datetime.today()-datetime.timedelta(days=500)
df=pd.DataFrame()
df['date']=3*[base + datetime.timedelta(days=x) for x in range(0, 7)]
df['date']=df['date'].dt.floor('d')
df['id']=sorted(7*list(range(5206,5209)))
df['v1']=[0,0,0,0,0,0,1,
0,1,0,1,0,0,0,
0,1,0,0,0,0,1]
Please also note that I don't mind if this is done using R, as I have no preference between R and Python for this task.
Upvotes: 1
Views: 99
Reputation: 66819
With R and data.table:
library(data.table)
setDT(DF)
DF[, v := do.call(pmax, shift(v1, 0:3, type="lead", fill=0L)), by=id]
date id v1 v2 v
1: 2017-05-29 5206 0 0 0
2: 2017-05-30 5206 0 0 0
3: 2017-05-31 5206 0 0 0
4: 2017-06-01 5206 0 1 1
5: 2017-06-02 5206 0 1 1
6: 2017-06-03 5206 0 1 1
7: 2017-06-04 5206 1 1 1
8: 2017-05-29 5207 0 1 1
9: 2017-05-30 5207 1 1 1
10: 2017-05-31 5207 0 1 1
11: 2017-06-01 5207 1 1 1
12: 2017-06-02 5207 0 0 0
13: 2017-06-03 5207 0 0 0
14: 2017-06-04 5207 0 0 0
15: 2017-05-29 5208 0 1 1
16: 2017-05-30 5208 1 1 1
17: 2017-05-31 5208 0 0 0
18: 2017-06-01 5208 0 1 1
19: 2017-06-02 5208 0 1 1
20: 2017-06-03 5208 0 1 1
21: 2017-06-04 5208 1 1 1
date id v1 v2 v
How it works: shift
with type "lead" looks ahead, in this case at distances of 0, 1, 2 or 3 (with undefined values replaced with zero). pmax
looks for the maximum value across these vectors, elementwise.
Similarly, from @RyanD's comment:
DF[order(date), v :=
do.call(pmax, shift(v1, 0:3, type="lead", fill=0L))
, by=id]
This has the advantage that it works even if the data is not sorted by date
. It sorts the data temporarily just while constructing the column.
Alternately, do a rolling join:
DF[, date := as.IDate(date)] # format
DF[, v := DF[v1 == 1][.SD, on=.(id, date), roll=-3, .N, by=.EACHI]$N]
This has the advantage that it works even if the enumeration of dates is incomplete. It looks up each row of DF
in DF[v1 == 1]
, counting any match 0-3 days in the future.
Data:
DF = structure(list(date = c("2017-05-29", "2017-05-30", "2017-05-31",
"2017-06-01", "2017-06-02", "2017-06-03", "2017-06-04", "2017-05-29",
"2017-05-30", "2017-05-31", "2017-06-01", "2017-06-02", "2017-06-03",
"2017-06-04", "2017-05-29", "2017-05-30", "2017-05-31", "2017-06-01",
"2017-06-02", "2017-06-03", "2017-06-04"), id = c(5206L, 5206L,
5206L, 5206L, 5206L, 5206L, 5206L, 5207L, 5207L, 5207L, 5207L,
5207L, 5207L, 5207L, 5208L, 5208L, 5208L, 5208L, 5208L, 5208L,
5208L), v1 = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L,
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L), v2 = c(0L, 0L, 0L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L
)), .Names = c("date", "id", "v1", "v2"), row.names = c(NA, -21L
), class = "data.frame")
Upvotes: 2
Reputation: 323316
Solution from pandas , with bfill
and limit
df.v1.where(df.v1==1).groupby(df['id']).bfill(3).fillna(0)
Out[223]:
0 0.0
1 0.0
2 0.0
3 1.0
4 1.0
5 1.0
6 1.0
7 1.0
8 1.0
9 1.0
10 1.0
11 0.0
12 0.0
13 0.0
14 1.0
15 1.0
16 0.0
17 1.0
18 1.0
19 1.0
20 1.0
Name: v1, dtype: float64
#df['v2']=df.v1.where(df.v1==1).groupby(df['id']).bfill(3).fillna(0)
Upvotes: 1