bob
bob

Reputation: 629

Mutate new column based on moving window of fixed date interval size, in R

I have data for one patient in R, which shows dates when they tested positive for a certain condition. The data looks like this:

      date  positive
2005-02-22      yes
2005-04-26       no
2005-08-02      yes
2005-10-04       no
2005-12-06      yes
2006-03-14       no
2006-06-06       no
2006-09-12      yes
2006-12-19      yes
2007-03-27      yes

Now I introduce a new definition. The patient's condition is defined as "chronic positive" if "current test is positive, and >=50% of tests in the prior 365 days were positive". So I want to create an output dataset that tells me whether the patient was chronically positive at each date. For example, the output should look like this (e.g. on 2006-09-12, they are "positive" but not "chronic positive" because 3 out of 4 visits in the previous 365 days were negative):

      date  positive  chronic
2005-02-22      yes        no
2005-04-26       no        no
2005-08-02      yes       yes
2005-10-04       no        no
2005-12-06      yes       yes
2006-03-14       no        no
2006-06-06       no        no
2006-09-12      yes        no
2006-12-19      yes        no
2007-03-27      yes       yes

How can I do this? At each row of interest, I need to be able to look at previous rows (within the last 365 days) and assess what proportion of them were positive. I think I could use a combination of the lead/lag functions and dplyr, but I would appreciate an example of how this can be done.

The original data can be reproduced with:

dat <- structure(list(date = structure(c(12836, 12899, 12997, 13060, 13123, 13221, 13305, 13403, 13501, 13599), class = "Date"), 
                      positive = c("yes", "no", "yes", "no", "yes", "no", "no", "yes", "yes", "yes")), 
                 row.names = c(NA, 10L), class = "data.frame")

Upvotes: 5

Views: 458

Answers (4)

Anoushiravan R
Anoushiravan R

Reputation: 21938

You can also use this solution in case you don't want to use rolling functions:

library(dplyr)
library(purrr)
library(lubridate)

map(df %>% 
      filter(positive == "yes") %>% 
      pull(date), ~ df %>% filter(date %within% interval(.x - days(365), .x))) %>% 
  map_dfr(~ .x %>% 
        summarise(date = last(date),
                  chronic = (sum(positive == "yes")-1)/ (n()-1) >= 0.5)) %>%
  right_join(df, by = "date") %>%
  arrange(date) %>%
  mutate(chronic = if_else(is.na(chronic) | !chronic, "no", "yes"))

# A tibble: 10 x 3
   date       chronic positive
   <chr>      <chr>   <chr>   
 1 2005-02-22 no      yes     
 2 2005-04-26 no      no      
 3 2005-08-02 yes     yes     
 4 2005-10-04 no      no      
 5 2005-12-06 yes     yes     
 6 2006-03-14 no      no      
 7 2006-06-06 no      no      
 8 2006-09-12 no      yes     
 9 2006-12-19 no      yes     
10 2007-03-27 yes     yes 

Upvotes: 3

chinsoon12
chinsoon12

Reputation: 25225

Another option using non-equi join in data.table:

library(data.table)
setDT(dat)[, yrago := date - 365L]
dat[, chronic := fifelse(
    .SD[.SD, on=.(date>=yrago, date<date), 
        by=.EACHI, .N>0 & i.positive=="yes" & sum(x.positive=="yes")/.N >= 0.5]$V1,
    "yes", "no")
]
dat[, yrago := NULL][]

output:

          date positive chronic
 1: 2005-02-22      yes      no
 2: 2005-04-26       no      no
 3: 2005-08-02      yes     yes
 4: 2005-10-04       no      no
 5: 2005-12-06      yes     yes
 6: 2006-03-14       no      no
 7: 2006-06-06       no      no
 8: 2006-09-12      yes      no
 9: 2006-12-19      yes      no
10: 2007-03-27      yes     yes

Upvotes: 4

AnilGoyal
AnilGoyal

Reputation: 26238

You can make use slider library for such rolling computation. Syntax explanation -

  • slide_index_lgl works on a vector .x and an index .i simultaneously and produces a logical vector output.
  • .x is used as positive vector
  • .i is used as date vector
  • .before and .after are self explanatory (previous 365 days included and current day excluded)
  • .f is simple where test positivity in previous 365 days is checked
  • this output is combined with another condition i.e. positive == 'yes' I used this formula (sum(.x == 'yes') / length(.x)) >= 0.5
  • 1 is added to this logical output, giving us 1 for FALSE and 2 for TRUE
  • This complete output is used as an index for output vector c('No', 'Yes') so that you'll get YesforTRUEandNoforFALSE`
library(tidyverse)

df <- read.table(header = TRUE, text = 'date  positive
2005-02-22      yes
2005-04-26       no
2005-08-02      yes
2005-10-04       no
2005-12-06      yes
2006-03-14       no
2006-06-06       no
2006-09-12      yes
2006-12-19      yes
2007-03-27      yes')

df$date <- as.Date(df$date)

library(slider)
library(lubridate)

df %>%
  mutate(chronic = c('No', "Yes")[1 + (positive == 'yes' &  slide_index_lgl(positive, date, 
                              ~ (sum(.x == 'yes') / length(.x)) >= 0.5  , 
                              .before = days(365), 
                              .after = days(-1)))])

#>          date positive chronic
#> 1  2005-02-22      yes    <NA>
#> 2  2005-04-26       no      No
#> 3  2005-08-02      yes     Yes
#> 4  2005-10-04       no      No
#> 5  2005-12-06      yes     Yes
#> 6  2006-03-14       no      No
#> 7  2006-06-06       no      No
#> 8  2006-09-12      yes      No
#> 9  2006-12-19      yes      No
#> 10 2007-03-27      yes     Yes

Alternative strategy using runner::runner() in baseR

dat <- structure(list(date = structure(c(12836, 12899, 12997, 13060, 13123, 13221, 13305, 13403, 13501, 13599), class = "Date"), 
                      positive = c("yes", "no", "yes", "no", "yes", "no", "no", "yes", "yes", "yes")), 
                 row.names = c(NA, 10L), class = "data.frame")

library(runner)

dat$chronic <- ifelse(runner(dat$positive, idx = dat$date, lag = '1 day',
                             k = '365 days',
                             f = \(.x) (sum(.x == 'yes')/length(.x)) >= 0.5) & dat$positive == 'yes', 'yes', 'no')
dat
#>          date positive chronic
#> 1  2005-02-22      yes    <NA>
#> 2  2005-04-26       no      no
#> 3  2005-08-02      yes     yes
#> 4  2005-10-04       no      no
#> 5  2005-12-06      yes     yes
#> 6  2006-03-14       no      no
#> 7  2006-06-06       no      no
#> 8  2006-09-12      yes      no
#> 9  2006-12-19      yes      no
#> 10 2007-03-27      yes     yes

Upvotes: 5

Ronak Shah
Ronak Shah

Reputation: 389175

Here is one way -

library(dplyr)
library(purrr)

dat %>%
  mutate(chronic = map_chr(row_number(), ~{
    inds <- between(date, date[.x] - 365, date[.x] - 1)
    if(positive[.x] == "yes" && any(inds) && mean(positive[inds] == 'yes') >= 0.5) 'yes' else 'no'
    }))

#         date positive chronic
#1  2005-02-22      yes      no
#2  2005-04-26       no      no
#3  2005-08-02      yes     yes
#4  2005-10-04       no      no
#5  2005-12-06      yes     yes
#6  2006-03-14       no      no
#7  2006-06-06       no      no
#8  2006-09-12      yes      no
#9  2006-12-19      yes      no
#10 2007-03-27      yes     yes

Upvotes: 2

Related Questions