Tom O
Tom O

Reputation: 1597

Dplyr conditional logic count number of rows

Hi there I'm trying to calculate the instances of an occurrence in my dataset. It's a very large dataset.

See below for the example:

     visitid   procedureid  collectiondatetime source status
     100       100.644      2016-12-03 17:20:00      N   COMP
     100       100.644      2017-09-21 12:00:00     RS   COMP
     100       100.644      2017-08-25 15:48:00      N   COMP
     100       100.644      2017-09-01 12:00:00     RS   COMP
     100       100.644      2017-08-23 10:31:00      N    CAN
     200       100.644      2017-09-01 14:00:00      N   COMP

I would like to determine if during a visit a procedure was cancelled (status= CAN) if the same procedure from same source is repeated at a later date and completed (status = COMP).
I would just like to summarize the number of times this occurs to determine if these cancelled occurrences are eventually corrected.

I've been away from R for a while and can't seem to figure this out.

Upvotes: 1

Views: 409

Answers (1)

Hallie Swan
Hallie Swan

Reputation: 2774

Updated Answer

Maybe this will work better. I made a new dataset that has the following cases:

  1. Multiple CAN before a CAN/COMP pair (visitid = 100)
  2. Multiple COMP before a CAN/COMP pair (visitid = 200)
  3. COMP without a CAN (visitid = 300)
  4. CAN without a COMP (visitid = 400)

I'm assuming that regardless the number of CAN before a CAN/COMP pair, we will only consider it "corrected" once. So, in this dataset, we'd expect 2 corrections.

## read in data
text <- "visitid procedureid  collectiondatetime source status
100     100.644 2016-06-01 17:20:00      N    CAN
100     100.644 2016-12-03 17:20:00      N    CAN
100     100.644 2017-08-23 10:31:00      N    CAN
100     100.644 2017-08-25 15:48:00      N   COMP
200     100.644 2017-09-01 12:00:00     RS   COMP
200     100.644 2017-09-21 12:00:00     RS   COMP
200     100.644 2017-09-01 14:00:00     RS   COMP
200     100.644 2017-10-01 14:00:00     RS    CAN
200     100.644 2017-11-01 14:00:00     RS   COMP
300     100.644 2017-11-02 14:00:00     RS   COMP
400     100.644 2017-12-01 14:00:00     RS   CAN"
file <- textConnection(text, encoding = "UTF-8")
coln <- readLines(file, n = 1)
coln <- strsplit(coln, " ")[[1]]
coln <- coln[coln != ""]
on.exit(close(file))
df <- read.fwf(file = file, 
               widths = c(3, 12, 20, 7, 7),
               strip.white = TRUE,
               stringsAsFactors = FALSE)
colnames(df) <- coln
rm(coln, file, text)

We can then match each CAN with the next status. If the next status (by date/time) is COMP, then it is considered "corrected".

library(tidyr)
library(dplyr)
test <- df %>%
        arrange(visitid, procedureid, source, collectiondatetime) %>%
        group_by(visitid, procedureid, source) %>%
        mutate(corrected = ifelse(status == "COMP", NA, 
                                 ifelse(lead(status) == "COMP", TRUE, NA))) %>%
        ungroup() %>%
        summarise(n = sum(corrected, na.rm = TRUE))
test


Original Answer

Not sure if this is exactly what you want, but here's an option. You could create pairs of CAN/COMP for each visit, procedure, and source. Then you could spread proceduredatetime by status, so you can check for pairs where the procedure was completed after it was cancelled.

library(dplyr)
library(tidyr)
test <- df %>%
    mutate(collectiondatetime = as.POSIXct(collectiondatetime)) %>%
    arrange(visitid, procedureid, source, collectiondatetime) %>%
    group_by(visitid, procedureid, source, status) %>%
    mutate(pair = row_number()) %>%
    spread(status, collectiondatetime) %>%
    ungroup() %>%
    mutate(corrected = CAN < COMP)
as.data.frame(test)

# output
  visitid procedureid source pair                 CAN                COMP corrected
1     100     100.644      N    1 2017-08-23 10:31:00 2017-08-25 15:48:00      TRUE
2     100     100.644     RS    1                <NA> 2017-09-01 12:00:00        NA
3     100     100.644     RS    2                <NA> 2017-09-21 12:00:00        NA
4     200     100.644      N    1                <NA> 2017-09-01 14:00:00        NA

You can then get the sum of corrected to get the number of times the procedure was performed after cancelling:

sum(test$corrected, na.rm = TRUE)

Note: this assumes that for a given visit, procedure, and source, there is never a COMP procedure followed by a CAN and then a COMP. If this situation does occur, you could first remove COMP procedures performed before CAN procedures for each visit/procedure/source.

Upvotes: 1

Related Questions