Jantje Houten
Jantje Houten

Reputation: 151

How to select value from a column in the last non-NA row within group and add it to another column to create a new column

I want to use the last non-NA value from the DISPENSED_DURATION column and add that to the DISPENSED_DATE column to get the final LAST_DATE column within each ID group.

I am currently looking at something like copy[,.SD[.N],ID] to get the last row, but can't figure out how to skip those NA's then add it back to DISPENSED_DATE.

Here is the sample code:

dt = data.table(
  ID = c(1,1,1,1,1,2,2,2,2,2),
  DATE = c("2020-01-01","2020-01-02","2020-01-03","2020-01-04","2020-01-05", "2020-01-06","2020-01-07","2020-01-08","2020-01-09","2020-01-10"),
  PRESCRIBED_DATE = c("2020-01-01","2020-01-02","2020-01-03","2020-01-04","2020-01-05", "2020-01-06","2020-01-07","2020-01-08", NA,"2020-01-10"),
  DISPENSED_DATE = c("2020-01-01","2020-01-02","2020-01-03","2020-01-04","2020-01-05", "2020-01-06","2020-01-07","2020-01-08", "2020-01-09","NA"),
  DISPENSED_DURATION = c(5,5,5,5,5,6,6,6,6,NA)
)

    ID PRESCRIBED_DATE DISPENSED_DATE DISPENSED_DURATION
 1:  1      2020-01-01     2020-01-01                  5
 2:  1      2020-01-02     2020-01-02                  5
 3:  1      2020-01-03     2020-01-03                  5
 4:  1      2020-01-04     2020-01-04                  5
 5:  1      2020-01-05     2020-01-05                  5
 6:  2      2020-01-06     2020-01-06                  6
 7:  2      2020-01-07     2020-01-07                  6
 8:  2      2020-01-08     2020-01-08                  6
 9:  2      2020-01-09     2020-01-09                  6
10:  2      2020-01-10           <NA>                 NA

Expected result:

   ID PRESCRIBED_DATE DISPENSED_DATE DISPENSED_DURATION  LAST_DATE
 1:  1      2020-01-01     2020-01-01                  5       <NA>
 2:  1      2020-01-02     2020-01-02                  5       <NA>
 3:  1      2020-01-03     2020-01-03                  5       <NA>
 4:  1      2020-01-04     2020-01-04                  5       <NA>
 5:  1      2020-01-05     2020-01-05                  5 2020-01-10
 6:  2      2020-01-06     2020-01-06                  6       <NA>
 7:  2      2020-01-07     2020-01-07                  6       <NA>
 8:  2      2020-01-08     2020-01-08                  6       <NA>
 9:  2      2020-01-09     2020-01-09                  6 2020-01-15
10:  2      2020-01-10           <NA>                 NA       <NA>

Thank you!

Upvotes: 0

Views: 43

Answers (3)

Peace Wang
Peace Wang

Reputation: 2419

I get one compact solution.

  1. filter out the NA of DISPENSED_DURATION
  2. add LAST_DATE with fcase only for last row of each group of ID
dt[!is.na(DISPENSED_DURATION),
   LAST_DATE:=fcase(rleidv(DATE) == .N, 
                    as.Date(DISPENSED_DATE) + DISPENSED_DURATION),
   by = ID]

Result:

#    ID       DATE PRESCRIBED_DATE DISPENSED_DATE DISPENSED_DURATION  LAST_DATE
# 1:  1 2020-01-01      2020-01-01     2020-01-01                  5       <NA>
# 2:  1 2020-01-02      2020-01-02     2020-01-02                  5       <NA>
# 3:  1 2020-01-03      2020-01-03     2020-01-03                  5       <NA>
# 4:  1 2020-01-04      2020-01-04     2020-01-04                  5       <NA>
# 5:  1 2020-01-05      2020-01-05     2020-01-05                  5 2020-01-10
# 6:  2 2020-01-06      2020-01-06     2020-01-06                  6       <NA>
# 7:  2 2020-01-07      2020-01-07     2020-01-07                  6       <NA>
# 8:  2 2020-01-08      2020-01-08     2020-01-08                  6       <NA>
# 9:  2 2020-01-09            <NA>     2020-01-09                  6 2020-01-15
#10:  2 2020-01-10      2020-01-10             NA                 NA       <NA>

ps: The "NA" in DISPENSED_DATE should be NA.

Upvotes: 2

s_baldur
s_baldur

Reputation: 33603

# First make sure your data is properly defined, "NA" is not same as NA
# and you can't add characters
dt[, DISPENSED_DATE := as.Date(DISPENSED_DATE)]

# Now select the relevant rows and add the two columns:
dt[dt[, last(.I[!is.na(DISPENSED_DATE)]), by = ID]$V1,
   LAST_DATE := DISPENSED_DATE + DISPENSED_DURATION]

#     ID       DATE PRESCRIBED_DATE DISPENSED_DATE DISPENSED_DURATION  LAST_DATE
#  1:  1 2020-01-01      2020-01-01     2020-01-01                  5       <NA>
#  2:  1 2020-01-02      2020-01-02     2020-01-02                  5       <NA>
#  3:  1 2020-01-03      2020-01-03     2020-01-03                  5       <NA>
#  4:  1 2020-01-04      2020-01-04     2020-01-04                  5       <NA>
#  5:  1 2020-01-05      2020-01-05     2020-01-05                  5 2020-01-10
#  6:  2 2020-01-06      2020-01-06     2020-01-06                  6       <NA>
#  7:  2 2020-01-07      2020-01-07     2020-01-07                  6       <NA>
#  8:  2 2020-01-08      2020-01-08     2020-01-08                  6       <NA>
#  9:  2 2020-01-09            <NA>     2020-01-09                  6 2020-01-15
# 10:  2 2020-01-10      2020-01-10           <NA>                 NA       <NA>

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389235

Convert the dates to date class first, create an empty Date column (LAST_DATE). Add last non NA DISPENSED_DATE to corresponding DISPENSED_DURATION for each ID.

library(data.table)

dt[, (2:4) := lapply(.SD, as.Date), .SDcols = 2:4]
dt[, LAST_DATE := as.Date(NA)]
dt[, LAST_DATE := {
  inds = max(which(!is.na(DISPENSED_DATE)))
  LAST_DATE[inds] =DISPENSED_DATE[inds] + DISPENSED_DURATION[inds]
  LAST_DATE
}, ID]

dt
#    ID       DATE PRESCRIBED_DATE DISPENSED_DATE DISPENSED_DURATION  LAST_DATE
# 1:  1 2020-01-01      2020-01-01     2020-01-01                  5       <NA>
# 2:  1 2020-01-02      2020-01-02     2020-01-02                  5       <NA>
# 3:  1 2020-01-03      2020-01-03     2020-01-03                  5       <NA>
# 4:  1 2020-01-04      2020-01-04     2020-01-04                  5       <NA>
# 5:  1 2020-01-05      2020-01-05     2020-01-05                  5 2020-01-10
# 6:  2 2020-01-06      2020-01-06     2020-01-06                  6       <NA>
# 7:  2 2020-01-07      2020-01-07     2020-01-07                  6       <NA>
# 8:  2 2020-01-08      2020-01-08     2020-01-08                  6       <NA>
# 9:  2 2020-01-09            <NA>     2020-01-09                  6 2020-01-15
#10:  2 2020-01-10      2020-01-10           <NA>                 NA       <NA>

Upvotes: 1

Related Questions