Reputation: 43
I have the following data structure in the data.table format:
ID Cycle Cycle_Day Cycle_Date Positive_Test_Date
1 1 1 3/28/2019 NA
1 1 2 3/29/2019 NA
1 1 3 3/30/2019 NA
1 1 NA NA 3/29/2019
1 2 1 4/23/2019 NA
1 2 2 4/24/2019 NA
1 2 3 4/25/2019 NA
1 2 NA NA 4/25/2019
2 1 1 3/18/2019 NA
2 1 2 3/19/2019 NA
2 1 3 3/20/2019 NA
2 1 NA NA 3/18/2019
2 2 1 4/23/2019 NA
2 2 2 4/24/2019 NA
2 2 3 4/25/2019 NA
2 2 NA NA 4/24/2019
I would like to create a new column "LH_Date" which will, for every ID and every cycle, copy the date in the event Cycle_Date and Positive_Test_Date match. Otherwise the value is NA. This is how it should look:
ID Cycle Cycle_Day Cycle_Date Positive_Test_Date LH_Date
1 1 1 3/28/2019 NA NA
1 1 2 3/29/2019 NA 3/29/2019
1 1 3 3/30/2019 NA NA
1 1 NA NA 3/29/2019 NA
1 2 1 4/23/2019 NA NA
1 2 2 4/24/2019 NA NA
1 2 3 4/25/2019 NA 4/25/2019
1 2 NA NA 4/25/2019 NA
2 1 1 3/18/2019 NA 3/18/2019
2 1 2 3/19/2019 NA NA
2 1 3 3/20/2019 NA NA
2 1 NA NA 3/18/2019 NA
2 2 1 4/23/2019 NA NA
2 2 2 4/24/2019 NA 4/24/2019
2 2 3 4/25/2019 NA NA
2 2 NA NA 4/24/2019 NA
Upvotes: 4
Views: 2742
Reputation: 25225
Another option is to use indexing to find the rows that fits the condition and update those rows only:
#for each group of ID and Cycle,
#find the row indices where Cycle_Date equals the last Positive_Test_Date
idxDT <- DT[, .I[Cycle_Date==Positive_Test_Date[.N]], .(ID, Cycle)]
#for those row indices, set the LH_Date to be Cycle_Date
#(NA rows or excluded rows defaults to NA by design in data.table)
DT[idxDT$V1, LH_Date := Cycle_Date]
idxDT
looks like this and idxDT$V1
extracts the column V1
:
ID Cycle V1
1: 1 1 2
2: 1 1 NA
3: 1 2 7
4: 1 2 NA
5: 2 1 9
6: 2 1 NA
7: 2 2 14
8: 2 2 NA
.I
contains the row index within a data.table. From ?.I
:
.I is an integer vector equal to seq_len(nrow(x)). While grouping, it holds for each item in the group, its row location in x. This is useful to subset in j; e.g. DT[, .I[which.max(somecol)], by=grp].
output:
ID Cycle Cycle_Day Cycle_Date Positive_Test_Date LH_Date
1: 1 1 1 3/28/2019 <NA> <NA>
2: 1 1 2 3/29/2019 <NA> 3/29/2019
3: 1 1 3 3/30/2019 <NA> <NA>
4: 1 1 NA <NA> 3/29/2019 <NA>
5: 1 2 1 4/23/2019 <NA> <NA>
6: 1 2 2 4/24/2019 <NA> <NA>
7: 1 2 3 4/25/2019 <NA> 4/25/2019
8: 1 2 NA <NA> 4/25/2019 <NA>
9: 2 1 1 3/18/2019 <NA> 3/18/2019
10: 2 1 2 3/19/2019 <NA> <NA>
11: 2 1 3 3/20/2019 <NA> <NA>
12: 2 1 NA <NA> 3/18/2019 <NA>
13: 2 2 1 4/23/2019 <NA> <NA>
14: 2 2 2 4/24/2019 <NA> 4/24/2019
15: 2 2 3 4/25/2019 <NA> <NA>
16: 2 2 NA <NA> 4/24/2019 <NA>
data:
library(data.table)
DT <- fread("ID Cycle Cycle_Day Cycle_Date Positive_Test_Date
1 1 1 3/28/2019 NA
1 1 2 3/29/2019 NA
1 1 3 3/30/2019 NA
1 1 NA NA 3/29/2019
1 2 1 4/23/2019 NA
1 2 2 4/24/2019 NA
1 2 3 4/25/2019 NA
1 2 NA NA 4/25/2019
2 1 1 3/18/2019 NA
2 1 2 3/19/2019 NA
2 1 3 3/20/2019 NA
2 1 NA NA 3/18/2019
2 2 1 4/23/2019 NA
2 2 2 4/24/2019 NA
2 2 3 4/25/2019 NA
2 2 NA NA 4/24/2019")
Upvotes: 1
Reputation: 1312
Assuming your data is in a data.table
named DT
:
First filter DT
where the Positive_Test_Date
is not NA
(since they're not useful for creating the new column)
ptd_notna <- DT[!is.na(Positive_Test_Date)]
Then join your original table with this one on ID
and Cycle
and create a new column idx
which is a boolean indicating whether the Cycle_Date
of the table X is equal to the Positive_Test_Date
of the table i for each group. Think of the join as X[Y]
, so you refer to the columns of X through x.colname
and the columns of Y through i.colname
(since Y is in the i
part of X).
DT[ptd_notna,
idx := x.Cycle_Date == i.Positive_Test_Date,
on = .(ID, Cycle)]
Now you can use this new column idx
to filter your table and assign the new column LH_Date
equal to the Cycle_Date
. Note how to use a boolean column in i
you use paretheses.
DT[(idx), LH_Date := Cycle_Date]
Then remove the idx
if you don't think you'll need it.
DT[, idx := NULL]
Upvotes: 1