Reputation: 2675
I have a function that I use to calculate the difference between two dates. I have a dataset with over 400K records, and I am having difficulty with getting it to work at scale.
The function:
library(lubridate)
get_recency <- function(last_gift_date, refresh_date) {
last_gift_date <- as.Date(last_gift_date)
refresh_date <- as.Date(refresh_date)
case_when(
is.na(last_gift_date) ~ "ERROR",
last_gift_date > refresh_date ~ "ERROR",
last_gift_date %m+% months(12) >= refresh_date ~ "0-12",
last_gift_date %m+% months(24) >= refresh_date ~ "13-24",
last_gift_date %m+% months(36) >= refresh_date ~ "25-36",
last_gift_date %m+% months(48) >= refresh_date ~ "37-48",
last_gift_date %m+% months(60) >= refresh_date ~ "49-60",
last_gift_date %m+% months(72) >= refresh_date ~ "61-72",
last_gift_date %m+% months(84) >= refresh_date ~ "73-84",
TRUE ~ "85+")
}
If I pass a single date into the refresh_date
parameter it seems to execute fine, however when I pass an equivalent length vector in, it takes super long.
Any thoughts on how to improve this would be appreciated.
Example to run the code:
a<- c("2014-01-29", "2015-04-07", "2015-04-10")
b<- c(NA, "2014-01-29", "2015-04-07")
get_recency(b,a)
# OUTPUT
#[1] "ERROR" "13-24" "0-12"
UPDATE 2017-07-10
I took the advise of @Akrun and made use of the cut()
function. It has the benefit of being quicker and more succinct code. Results below.
get_recency <- function(last_gift_date, refresh_date) {
last_gift_date <- as.Date(last_gift_date)
refresh_date <- as.Date(refresh_date)
x <- (as.yearmon(refresh_date)-as.yearmon(last_gift_date))*12
x <- replace(x, is.na(x), -Inf)
cut(x, breaks = c(-Inf, -0.000001, 12, 24, 36, 48, 60, 72, 84, Inf),
labels = c("ERROR", "0-12", "13-24", "25-36", "37-48",
"49-60", "61-72", "73-84", "85+"),
include.lowest = T)
}
Upvotes: 1
Views: 83
Reputation: 11728
library(lubridate)
library(dplyr)
a <- c("2014-01-29", "2015-04-07", "2015-04-10", "2025-04-10")
b <- c(NA, "2014-01-29", "2015-04-07", "2015-04-07")
intervals <- 12 * 1:7
get_recency <- function(last_gift_date, refresh_date, intervals) {
last_gift_date <- as.Date(last_gift_date)
refresh_date <- as.Date(refresh_date)
intervals_chr <- c(
"ERROR",
paste(c(0, intervals[-length(intervals)] + 1), intervals, sep = "-"),
paste0(tail(intervals, 1) + 1, "+")
)
code <- sapply(c(0, intervals), function(n) {
last_gift_date %m+% months(n) < refresh_date
}) %>%
rowSums()
if_else(condition = is.na(code), true = "ERROR",
false = intervals_chr[code + 1])
}
get_recency(b, a, intervals)
[1] "ERROR" "13-24" "0-12" "85+"
Is this faster?
Upvotes: 3