Reputation: 1597
Hi there I'm trying to calculate the instances of an occurrence in my dataset. It's a very large dataset.
See below for the example:
visitid procedureid collectiondatetime source status
100 100.644 2016-12-03 17:20:00 N COMP
100 100.644 2017-09-21 12:00:00 RS COMP
100 100.644 2017-08-25 15:48:00 N COMP
100 100.644 2017-09-01 12:00:00 RS COMP
100 100.644 2017-08-23 10:31:00 N CAN
200 100.644 2017-09-01 14:00:00 N COMP
I would like to determine if during a visit a procedure was cancelled (status= CAN) if the same procedure from same source is repeated at a later date and completed (status = COMP).
I would just like to summarize the number of times this occurs to determine if these cancelled occurrences are eventually corrected.
I've been away from R for a while and can't seem to figure this out.
Upvotes: 1
Views: 409
Reputation: 2774
Maybe this will work better. I made a new dataset that has the following cases:
I'm assuming that regardless the number of CAN before a CAN/COMP pair, we will only consider it "corrected" once. So, in this dataset, we'd expect 2 corrections.
## read in data
text <- "visitid procedureid collectiondatetime source status
100 100.644 2016-06-01 17:20:00 N CAN
100 100.644 2016-12-03 17:20:00 N CAN
100 100.644 2017-08-23 10:31:00 N CAN
100 100.644 2017-08-25 15:48:00 N COMP
200 100.644 2017-09-01 12:00:00 RS COMP
200 100.644 2017-09-21 12:00:00 RS COMP
200 100.644 2017-09-01 14:00:00 RS COMP
200 100.644 2017-10-01 14:00:00 RS CAN
200 100.644 2017-11-01 14:00:00 RS COMP
300 100.644 2017-11-02 14:00:00 RS COMP
400 100.644 2017-12-01 14:00:00 RS CAN"
file <- textConnection(text, encoding = "UTF-8")
coln <- readLines(file, n = 1)
coln <- strsplit(coln, " ")[[1]]
coln <- coln[coln != ""]
on.exit(close(file))
df <- read.fwf(file = file,
widths = c(3, 12, 20, 7, 7),
strip.white = TRUE,
stringsAsFactors = FALSE)
colnames(df) <- coln
rm(coln, file, text)
We can then match each CAN with the next status. If the next status (by date/time) is COMP, then it is considered "corrected".
library(tidyr)
library(dplyr)
test <- df %>%
arrange(visitid, procedureid, source, collectiondatetime) %>%
group_by(visitid, procedureid, source) %>%
mutate(corrected = ifelse(status == "COMP", NA,
ifelse(lead(status) == "COMP", TRUE, NA))) %>%
ungroup() %>%
summarise(n = sum(corrected, na.rm = TRUE))
test
Not sure if this is exactly what you want, but here's an option. You could create pairs of CAN/COMP for each visit, procedure, and source. Then you could spread proceduredatetime
by status
, so you can check for pairs where the procedure was completed after it was cancelled.
library(dplyr)
library(tidyr)
test <- df %>%
mutate(collectiondatetime = as.POSIXct(collectiondatetime)) %>%
arrange(visitid, procedureid, source, collectiondatetime) %>%
group_by(visitid, procedureid, source, status) %>%
mutate(pair = row_number()) %>%
spread(status, collectiondatetime) %>%
ungroup() %>%
mutate(corrected = CAN < COMP)
as.data.frame(test)
# output
visitid procedureid source pair CAN COMP corrected
1 100 100.644 N 1 2017-08-23 10:31:00 2017-08-25 15:48:00 TRUE
2 100 100.644 RS 1 <NA> 2017-09-01 12:00:00 NA
3 100 100.644 RS 2 <NA> 2017-09-21 12:00:00 NA
4 200 100.644 N 1 <NA> 2017-09-01 14:00:00 NA
You can then get the sum of corrected
to get the number of times the procedure was performed after cancelling:
sum(test$corrected, na.rm = TRUE)
Note: this assumes that for a given visit, procedure, and source, there is never a COMP procedure followed by a CAN and then a COMP. If this situation does occur, you could first remove COMP procedures performed before CAN procedures for each visit/procedure/source.
Upvotes: 1