Reputation: 21

Count the number of rows within a certain time range based on each row in R (tidyverse)

I want to count the number of rows within a certain time range based on each row after grouping by id. For instance, let us say a 1-month window around each datetime entry in the column "cleaned_date".

head(data$cleaned_date)

[1] "2004-10-11 CDT" "2008-09-10 CDT" "2011-10-25 CDT" "2011-12-31 CST"

The dates are in POSIXct format.

For the first entry, I need to count the number of rows within the time from 2004-09-11 to 2004-11-11, for the second entry, count the number of rows within the time from 2008-08-10 to 2008-10-10, so on and so forth.

I used roughly the following code

data %>% group_by(id) %>% filter(cleaned_date %within% interval(cleaned_date - 24 * 60 * 60 * 30, cleaned_date + 24 * 60 * 60 * 30)) %>% mutate(counts = n())

But it does not seem to work and I got counts as an empty column. Any help would be appreciated, thanks!

A reproducible example can be the following:

The input is

  cleaned_date id
1   2008-09-11  A
2   2008-09-10  B
3   2008-09-30  B
4   2011-10-25  A
5   2011-11-14  A

And I want the output to be

  cleaned_date id counts
1   2008-09-11  A      1
2   2008-09-10  B      2
3   2008-09-30  B      2
4   2011-10-25  A      2
5   2011-11-14  A      2

For the first entry, I want to count the rows in the timeframe 2008-08-11 to 2008-10-11, the second entry seems to satisfy but we need to group by "id", so it does not count. For the second entry I want to count the rows in the timeframe 2008-08-10 to 2008-10-10, rows 2 and 3 satisfy, so the counts is 2. For the third entry I want to count the rows in the timeframe 2008-08-30 to 2008-10-30, rows 2 and 3 satisfy again, so on and so forth.

Note that the actual dataset I would like to operate on has millions of rows, so it might be more efficient to use tidyverse rather than base R.

Upvotes: 0

Answers (2)

Snipeskies

Reputation: 288

Perhaps not the most elegant solution.

# input data. Dates as character vector
input = data.frame(
    cleaned_date = c("2008-09-11", "2008-09-10", "2008-09-30", "2011-10-25", "2011-11-14"), 
    id = c("A", "B", "B", "A", "A")
    )

# function to create a date window n months around specified date
window <- function(x, n = 1){
    x <- rep(as.POSIXlt(x),2)
    x[1]$mon <- x[1]$mon - n
    x[2]$mon <- x[2]$mon + n
    return(format(seq(from = x[1], to = x[2], by = "day"), format="%Y-%m-%d"))
}

# find counts for each row
input$counts <- unlist(lapply(1:nrow(input), function(x){
    length(which((input$cleaned_date %in% window(input$cleaned_date[x])) & input$id == input$id[x]))
    }))

input

  cleaned_date id counts
1   2008-09-11  A      1
2   2008-09-10  B      2
3   2008-09-30  B      2
4   2011-10-25  A      2
5   2011-11-14  A      2

Edit for large datasets:

# dummy dataset with 1,000,000 rows
years <- c(2000:2020)
months <- c(1:12)
days <- c(1:20)
n <- 1000000
dates <- paste(sample(years, size = n, replace = T), sample(months, size = n, replace = T), sample(days, size = n, replace = T), sep = "-")
groups <- sample(c("A","B","C"), size = n, replace = T)
input <- data.frame(
    cleaned_date = dates,
    id = groups
)
input$cleaned_date <- format(as.POSIXlt(input$cleaned_date), format="%Y-%m-%d")

# optional, sort data by date for small boost in performance
input <- input[order(input$cleaned_date),]

counts <- NULL
#pb <- progress::progress_bar$new(total = length(unique(input$cleaned_date)))
t1 <- Sys.time()
# split up vectorization for each unique date.
for(date in unique(input$cleaned_date)){
    #pb$tick()
    w <- window(date)
    tmp <- input[which(input$cleaned_date %in% w),]
    tmp_counts <- unlist(lapply(which(tmp$cleaned_date == date), function(x){
        length(which(tmp$id == tmp$id[x]))
    }))
    counts <- c(counts, tmp_counts)
}
# add counts to dataset
input$counts <- counts 
# optional, re-order data to original format
input <- input[order(as.numeric(rownames(input))),]
print(Sys.time() - t1)

Time difference of 3.247204 mins

If you want to go faster, you can run the loop in parallel

library(foreach)
library(doParallel)

cores=detectCores()
cl <- makeCluster(cores[1]-1)
registerDoParallel(cl)

dates = unique(input$cleaned_date)
t1 <- Sys.time()
counts <- foreach(i=1:length(dates), .combine= "c") %dopar% {
    w <- window(dates[i])
    tmp <- input[which(input$cleaned_date %in% w),]
    tmp_counts <- unlist(lapply(which(tmp$cleaned_date == dates[i]), function(x){
        length(which(tmp$id == tmp$id[x]))
    }))
    tmp_counts
}
stopCluster(cl)
input$counts <- counts
input <- input[order(as.numeric(rownames(input))),]
print(Sys.time() - t1)

Time difference of 37.37211 secs

Note, I'm running this on a MacBook Pro with a 2.3 GHz Quad-Core Intel Core i7 and 16 GB of RAM.

Upvotes: 1

Matt

Reputation: 7385

It is still hard to determine exactly what you're trying to accomplish, but this will at least get you counts for a specified date range:

df %>% 
  group_by(id) %>% 
  filter(cleaned_date >= "2008-08-11" & cleaned_date <= "2008-10-11") %>% 
  mutate(counts = n())

Will give us:

  cleaned_date id    counts
  <date>       <chr>  <int>
1 2008-09-11   A          1
2 2008-09-10   B          2
3 2008-09-30   B          2

Upvotes: 0

Count the number of rows within a certain time range based on each row in R (tidyverse)

Answers (2)

Related Questions