user12239114
user12239114

Reputation: 43

How to create a new column in data.table based on values of other columns

I have the following data structure in the data.table format:

ID  Cycle  Cycle_Day Cycle_Date  Positive_Test_Date
1   1      1         3/28/2019   NA
1   1      2         3/29/2019   NA
1   1      3         3/30/2019   NA
1   1      NA        NA          3/29/2019
1   2      1         4/23/2019   NA 
1   2      2         4/24/2019   NA
1   2      3         4/25/2019   NA
1   2      NA        NA          4/25/2019
2   1      1         3/18/2019   NA
2   1      2         3/19/2019   NA
2   1      3         3/20/2019   NA
2   1      NA        NA          3/18/2019
2   2      1         4/23/2019   NA 
2   2      2         4/24/2019   NA
2   2      3         4/25/2019   NA
2   2      NA        NA          4/24/2019

I would like to create a new column "LH_Date" which will, for every ID and every cycle, copy the date in the event Cycle_Date and Positive_Test_Date match. Otherwise the value is NA. This is how it should look:

ID  Cycle  Cycle_Day Cycle_Date  Positive_Test_Date LH_Date
1   1      1         3/28/2019   NA                 NA 
1   1      2         3/29/2019   NA                 3/29/2019
1   1      3         3/30/2019   NA                 NA
1   1      NA        NA          3/29/2019          NA
1   2      1         4/23/2019   NA                 NA
1   2      2         4/24/2019   NA                 NA
1   2      3         4/25/2019   NA                 4/25/2019
1   2      NA        NA          4/25/2019          NA
2   1      1         3/18/2019   NA                 3/18/2019
2   1      2         3/19/2019   NA                 NA
2   1      3         3/20/2019   NA                 NA 
2   1      NA        NA          3/18/2019          NA
2   2      1         4/23/2019   NA                 NA
2   2      2         4/24/2019   NA                 4/24/2019
2   2      3         4/25/2019   NA                 NA
2   2      NA        NA          4/24/2019          NA

Upvotes: 4

Views: 2742

Answers (2)

chinsoon12
chinsoon12

Reputation: 25225

Another option is to use indexing to find the rows that fits the condition and update those rows only:

#for each group of ID and Cycle, 
#find the row indices where Cycle_Date equals the last Positive_Test_Date 
idxDT <- DT[, .I[Cycle_Date==Positive_Test_Date[.N]], .(ID, Cycle)]

#for those row indices, set the LH_Date to be Cycle_Date 
#(NA rows or excluded rows defaults to NA by design in data.table)
DT[idxDT$V1, LH_Date := Cycle_Date]

idxDT looks like this and idxDT$V1 extracts the column V1:

   ID Cycle V1
1:  1     1  2
2:  1     1 NA
3:  1     2  7
4:  1     2 NA
5:  2     1  9
6:  2     1 NA
7:  2     2 14
8:  2     2 NA

.I contains the row index within a data.table. From ?.I:

.I is an integer vector equal to seq_len(nrow(x)). While grouping, it holds for each item in the group, its row location in x. This is useful to subset in j; e.g. DT[, .I[which.max(somecol)], by=grp].

output:

    ID Cycle Cycle_Day Cycle_Date Positive_Test_Date   LH_Date
 1:  1     1         1  3/28/2019               <NA>      <NA>
 2:  1     1         2  3/29/2019               <NA> 3/29/2019
 3:  1     1         3  3/30/2019               <NA>      <NA>
 4:  1     1        NA       <NA>          3/29/2019      <NA>
 5:  1     2         1  4/23/2019               <NA>      <NA>
 6:  1     2         2  4/24/2019               <NA>      <NA>
 7:  1     2         3  4/25/2019               <NA> 4/25/2019
 8:  1     2        NA       <NA>          4/25/2019      <NA>
 9:  2     1         1  3/18/2019               <NA> 3/18/2019
10:  2     1         2  3/19/2019               <NA>      <NA>
11:  2     1         3  3/20/2019               <NA>      <NA>
12:  2     1        NA       <NA>          3/18/2019      <NA>
13:  2     2         1  4/23/2019               <NA>      <NA>
14:  2     2         2  4/24/2019               <NA> 4/24/2019
15:  2     2         3  4/25/2019               <NA>      <NA>
16:  2     2        NA       <NA>          4/24/2019      <NA>

data:

library(data.table)
DT <- fread("ID  Cycle  Cycle_Day Cycle_Date  Positive_Test_Date
1   1      1         3/28/2019   NA
1   1      2         3/29/2019   NA
1   1      3         3/30/2019   NA
1   1      NA        NA          3/29/2019
1   2      1         4/23/2019   NA 
1   2      2         4/24/2019   NA
1   2      3         4/25/2019   NA
1   2      NA        NA          4/25/2019
2   1      1         3/18/2019   NA
2   1      2         3/19/2019   NA
2   1      3         3/20/2019   NA
2   1      NA        NA          3/18/2019
2   2      1         4/23/2019   NA 
2   2      2         4/24/2019   NA
2   2      3         4/25/2019   NA
2   2      NA        NA          4/24/2019")

Upvotes: 1

josemz
josemz

Reputation: 1312

Assuming your data is in a data.table named DT:

First filter DT where the Positive_Test_Date is not NA (since they're not useful for creating the new column)

ptd_notna <- DT[!is.na(Positive_Test_Date)]

Then join your original table with this one on ID and Cycle and create a new column idx which is a boolean indicating whether the Cycle_Date of the table X is equal to the Positive_Test_Date of the table i for each group. Think of the join as X[Y], so you refer to the columns of X through x.colname and the columns of Y through i.colname (since Y is in the i part of X).

DT[ptd_notna, 
   idx := x.Cycle_Date == i.Positive_Test_Date, 
   on = .(ID, Cycle)]

Now you can use this new column idx to filter your table and assign the new column LH_Date equal to the Cycle_Date. Note how to use a boolean column in i you use paretheses.

DT[(idx), LH_Date := Cycle_Date]

Then remove the idx if you don't think you'll need it.

DT[, idx := NULL]

Upvotes: 1

Related Questions