Reputation: 21
I want to count the number of rows within a certain time range based on each row after grouping by id. For instance, let us say a 1-month window around each datetime entry in the column "cleaned_date".
head(data$cleaned_date)
[1] "2004-10-11 CDT" "2008-09-10 CDT" "2011-10-25 CDT" "2011-12-31 CST"
The dates are in POSIXct format.
For the first entry, I need to count the number of rows within the time from 2004-09-11 to 2004-11-11, for the second entry, count the number of rows within the time from 2008-08-10 to 2008-10-10, so on and so forth.
I used roughly the following code
data %>% group_by(id) %>% filter(cleaned_date %within% interval(cleaned_date - 24 * 60 * 60 * 30, cleaned_date + 24 * 60 * 60 * 30)) %>% mutate(counts = n())
But it does not seem to work and I got counts as an empty column. Any help would be appreciated, thanks!
A reproducible example can be the following:
The input is
cleaned_date id
1 2008-09-11 A
2 2008-09-10 B
3 2008-09-30 B
4 2011-10-25 A
5 2011-11-14 A
And I want the output to be
cleaned_date id counts
1 2008-09-11 A 1
2 2008-09-10 B 2
3 2008-09-30 B 2
4 2011-10-25 A 2
5 2011-11-14 A 2
For the first entry, I want to count the rows in the timeframe 2008-08-11 to 2008-10-11, the second entry seems to satisfy but we need to group by "id", so it does not count. For the second entry I want to count the rows in the timeframe 2008-08-10 to 2008-10-10, rows 2 and 3 satisfy, so the counts is 2. For the third entry I want to count the rows in the timeframe 2008-08-30 to 2008-10-30, rows 2 and 3 satisfy again, so on and so forth.
Note that the actual dataset I would like to operate on has millions of rows, so it might be more efficient to use tidyverse rather than base R.
Upvotes: 0
Views: 1347
Reputation: 288
Perhaps not the most elegant solution.
# input data. Dates as character vector
input = data.frame(
cleaned_date = c("2008-09-11", "2008-09-10", "2008-09-30", "2011-10-25", "2011-11-14"),
id = c("A", "B", "B", "A", "A")
)
# function to create a date window n months around specified date
window <- function(x, n = 1){
x <- rep(as.POSIXlt(x),2)
x[1]$mon <- x[1]$mon - n
x[2]$mon <- x[2]$mon + n
return(format(seq(from = x[1], to = x[2], by = "day"), format="%Y-%m-%d"))
}
# find counts for each row
input$counts <- unlist(lapply(1:nrow(input), function(x){
length(which((input$cleaned_date %in% window(input$cleaned_date[x])) & input$id == input$id[x]))
}))
input
cleaned_date id counts
1 2008-09-11 A 1
2 2008-09-10 B 2
3 2008-09-30 B 2
4 2011-10-25 A 2
5 2011-11-14 A 2
Edit for large datasets:
# dummy dataset with 1,000,000 rows
years <- c(2000:2020)
months <- c(1:12)
days <- c(1:20)
n <- 1000000
dates <- paste(sample(years, size = n, replace = T), sample(months, size = n, replace = T), sample(days, size = n, replace = T), sep = "-")
groups <- sample(c("A","B","C"), size = n, replace = T)
input <- data.frame(
cleaned_date = dates,
id = groups
)
input$cleaned_date <- format(as.POSIXlt(input$cleaned_date), format="%Y-%m-%d")
# optional, sort data by date for small boost in performance
input <- input[order(input$cleaned_date),]
counts <- NULL
#pb <- progress::progress_bar$new(total = length(unique(input$cleaned_date)))
t1 <- Sys.time()
# split up vectorization for each unique date.
for(date in unique(input$cleaned_date)){
#pb$tick()
w <- window(date)
tmp <- input[which(input$cleaned_date %in% w),]
tmp_counts <- unlist(lapply(which(tmp$cleaned_date == date), function(x){
length(which(tmp$id == tmp$id[x]))
}))
counts <- c(counts, tmp_counts)
}
# add counts to dataset
input$counts <- counts
# optional, re-order data to original format
input <- input[order(as.numeric(rownames(input))),]
print(Sys.time() - t1)
Time difference of 3.247204 mins
If you want to go faster, you can run the loop in parallel
library(foreach)
library(doParallel)
cores=detectCores()
cl <- makeCluster(cores[1]-1)
registerDoParallel(cl)
dates = unique(input$cleaned_date)
t1 <- Sys.time()
counts <- foreach(i=1:length(dates), .combine= "c") %dopar% {
w <- window(dates[i])
tmp <- input[which(input$cleaned_date %in% w),]
tmp_counts <- unlist(lapply(which(tmp$cleaned_date == dates[i]), function(x){
length(which(tmp$id == tmp$id[x]))
}))
tmp_counts
}
stopCluster(cl)
input$counts <- counts
input <- input[order(as.numeric(rownames(input))),]
print(Sys.time() - t1)
Time difference of 37.37211 secs
Note, I'm running this on a MacBook Pro with a 2.3 GHz Quad-Core Intel Core i7 and 16 GB of RAM.
Upvotes: 1
Reputation: 7385
It is still hard to determine exactly what you're trying to accomplish, but this will at least get you counts for a specified date range:
df %>%
group_by(id) %>%
filter(cleaned_date >= "2008-08-11" & cleaned_date <= "2008-10-11") %>%
mutate(counts = n())
Will give us:
cleaned_date id counts
<date> <chr> <int>
1 2008-09-11 A 1
2 2008-09-10 B 2
3 2008-09-30 B 2
Upvotes: 0