jos0909
jos0909

Reputation: 77

Conditional sum in R w NA values?

I am trying to calculate the number of assays a given patient has within pre-specified time periods. E.g., between 14 and 45 days after a patient receives a third dose of vaccine, how many assays were taken? However, I don't want to include any assays taken after the patient receives a fourth dose of vaccine.

My dataset is in long format and contains a variable indicating each date that an assay was completed, and variables for the date of each vaccination. Below is a contrived example of my data frame.

     id      assay_date   dose_3_date   dose_4_date
1   1001      20mar2021   22feb2021     17aug2021
2   1001      06jun2021   22feb2021     17aug2021      
3   1001      24sep2021   22feb2021     17aug2021
4   1001      19nov2021   22feb2021     17aug2021
5   1006      29apr2021   02apr2021     .
6   1006      23may2021   02apr2021     .
7   1006      15jun2021   02apr2021     .

I'm unsure how I can sum the cases where the date of the assay falls in my pre-specified date range, while at the same time ensuring that I'm not including assays taken after a fourth vaccine dose. The challenge is that most of the patients in my dataset have not received a fourth dose and therefore have a missing value for dose_4_date.

My first thought was to use case_when to make a flag for the cases in which the assay_date is between 14 and 45 days after the dose_3_date, but not after the dose_4_date, and then sum the flags somehow. Below is what I've written so far:

df %>% mutate(post = case_when(assay_date >= dose_3_date+14 & assay_date <= dose_3_date+45 
                               & assay_date <= dose_4_date & !is.na(dose_4_date) ~ 1),
              
              post3 = case_when(assay_date >= dose_3_date+60 & assay_date <= dose_3_date+120
                                & assay_date <= dose_4_date & !is.na(dose_4_date) ~ 1),
              
              post6 = case_when(assay_date >= dose_3_date+135 & assay_date <= dose_3_date+210
                                & assay_date <= dose_4_date & !is.na(dose_4_date) ~ 1))

The above code works well for patients with a dose_4_date, but results in NA values for those with a "missing" dose_4_date. I'm unsure how I can ignore the NAs for patients with a missing dose_4_date.

I'm also unsure how to sum the flags afterward.

Any advice would be greatly appreciated!

Upvotes: 1

Views: 51

Answers (2)

crestor
crestor

Reputation: 1466

library(tidyverse)

df <- tibble::tribble(
  ~id, ~assay_date, ~dose_3_date, ~dose_4_date,
  1001L, "20mar2021",  "22feb2021",  "17aug2021",
  1001L, "06jun2021",  "22feb2021",  "17aug2021",
  1001L, "24sep2021",  "22feb2021",  "17aug2021",
  1001L, "19nov2021",  "22feb2021",  "17aug2021",
  1006L, "29apr2021",  "02apr2021",           NA,
  1006L, "23may2021",  "02apr2021",           NA,
  1006L, "15jun2021",  "02apr2021",           NA
)

df |>
  mutate(across(-id, lubridate::dmy)) |>
  filter(between(as.integer(assay_date - dose_3_date), 14, 45)
         & (assay_date <= dose_4_date | is.na(dose_4_date))) |>
  count(id)
#> # A tibble: 2 × 2
#>      id     n
#>   <int> <int>
#> 1  1001     1
#> 2  1006     1

Upvotes: 1

Sweepy Dodo
Sweepy Dodo

Reputation: 1863

library(data.table)

# dummy data
df <- data.table(id = rep(c(1,2), times=c(4,3))
                 , assay_date = c('20mar2021', '06jun2021', '24sep2021', '19nov2021', '29apr2021', '23may2021', '15jun2021')
                 , dose_3_date = rep(c('22feb2021', '02apr2021'), times=c(4,3))
                 , dose_4_date = c(rep(c('17aug2021', NA), times=c(4,3)))
                 ); df

# set as data.table if yours isn't one already
setDT(df)

# as.Date
x <- c("assay_date", "dose_3_date", "dose_4_date")
df[, (x) := lapply(.SD, \(i) as.Date(i, format="%d%b%Y")), .SDcols=x
   ][, date_diff := assay_date - dose_3_date   # calculate date diff
     ]

# flag rows which fit criteria
df[date_diff %between% c(14, 45)
   & (assay_date <= dose_4_date
      | is.na(dose_4_date)
      )
   , fits_criteria := 1
   ]

# count per patient
df[, .(assays_in_period = sum(fits_criteria, na.rm=T)), id]


   id assays_in_period
1:  1                1
2:  2                1

Upvotes: 1

Related Questions