Optimize performance for finding ranges in tibble/data.frame

Question

I have the following dataset (more than 30.000 rows in reality):

dt <- tibble::tibble(x=seq.Date(as.Date("2019-01-01"), as.Date("2019-01-10"), "days"), y=c(1,2,3,2,1,1,3,1,2,1))

A tibble: 10 x 2
   x              y
        
 1 2019-01-01     1
 2 2019-01-02     2
 3 2019-01-03     3
 4 2019-01-04     2
 5 2019-01-05     1
 6 2019-01-06     1
 7 2019-01-07     3
 8 2019-01-08     1
 9 2019-01-09     2
10 2019-01-10     1

I would like to identify date-ranges above a given threshold, for example y >= 2. The first date of appearance of the range should be saved in a separate tibble as "start", the last date of appearance as "end". Each "start"/"end"-combination should be saved in a separate line. The ranges are separated from each other by "y"-values that are below the threshold (e.g. y < 2). The result should look like the following table:

result <- tibble::tibble(start=as.Date(c("2019-01-02", "2019-01-07", "2019-01-09")), end=as.Date(c("2019-01-04", "2019-01-07", "2019-01-09")))

A tibble: 3 x 2
   start      end       
            
 1 2019-01-02 2019-01-04
 2 2019-01-07 2019-01-07
 3 2019-01-09 2019-01-09

My current solution is using for-loops. However, this leads to very slow execution.

Any ideas how to improve the performance and solve this issue more elegant?

Thank you for your ideas in advance.

Cole · Accepted Answer

This should work and is based on the idea that the diff of the dates minus 1 will have a cumulative sum equal to each other. That's why we can group by it.

dt%>%
  filter(y >= 2)%>%
  group_by(grouping = cumsum(c(0, diff.Date(x) - 1)))%>%
  summarize(start = min(x)
            , end = max(x))%>%
  select(-grouping)

# A tibble: 3 x 2
  start      end       
           
1 2019-01-02 2019-01-04
2 2019-01-07 2019-01-07
3 2019-01-09 2019-01-09

Optimize performance for finding ranges in tibble/data.frame

Answers (2)

Related Questions