Irrational
Irrational

Reputation: 1

Filter data.table between multiple date ranges in R

since a few days I'm working with R and now I have my first problem. I am a complete beginner in R.

I want to check the data of dt between the **start date **and the end date and then write the start date and end date each as a column in dt.

Then I want to order the data in descending order by datetime within the group (ID, start date, end date). For this I would then use the following:

dt[ , valRank:=rank(-datetime), by= list(id, start date, end date)]

My data and search parameters look like this:

library(data.table)

# data.table
dt <- data.table(ID = 1:5
                 , datetime = seq(from = as.Date("01.01.2020", format = "%d.%m.%Y"), by ="month", length.out = 20)
                 , var = rnorm(20))

# Search parameters
start <- c(as.Date("01.01.2020", format = "%d.%m.%Y"), as.Date("01.07.2020", format = "%d.%m.%Y"))
end <- c(as.Date("01.03.2020", format = "%d.%m.%Y"), as.Date("01.05.2021", format = "%d.%m.%Y"))
range <- data.table(start = start, end = end)

Now I have tried several filter versions and wonder why between() does not work but inrange() does?

# Version 0: Why it doesnt work with "between"
filtered0 <- dt[between(x= datetime, lower = range$start, upper = range$end, incbounds=TRUE)]
filtered0
ID   datetime        var
1:  1 2020-01-01 0.50298317
2:  3 2020-03-01 2.06969108
3:  3 2020-08-01 0.62906692
4:  5 2020-10-01 0.75532708
5:  2 2020-12-01 0.59471124
6:  4 2021-02-01 1.54694582
7:  1 2021-04-01 0.07542291
# Version 1: 
filtered1 <- setDT(dt)[datetime %inrange% range]
filtered1
# Version 2:
filtered2<- dt[datetime %inrange% range]
# Version 3:
filtered3 <- setDT(dt)[inrange(x = datetime, lower = range$start, upper = range$end, incbounds=TRUE)]

Now I tried for the first time to write the start and end date into the data.table and wonder why the column "start" now contains numbers although the format as.Date(..., , format = "%d.%m.%Y") does not work.

filtered4 <- setDT(dt)[, `:=` (start = ifelse(datetime %inrange% range, range$start[1], range$start[1]) 
                               , end = ifelse(datetime %inrange% range, range$end[1], range$end[1] ) )]

   ID   datetime         var start   end
 1:  1 2020-01-01  0.50298317 18262 18322
 2:  2 2020-02-01 -0.26148738 18262 18322
 3:  3 2020-03-01  2.06969108 18262 18322
 4:  4 2020-04-01  1.99797128 18262 18322
 5:  5 2020-05-01 -0.93218939 18262 18322
 6:  1 2020-06-01  0.35842662 18262 18322
 7:  2 2020-07-01 -1.07181184 18262 18322
 8:  3 2020-08-01  0.62906692 18262 18322
 9:  4 2020-09-01  1.34515866 18262 18322
10:  5 2020-10-01  0.75532708 18262 18322
11:  1 2020-11-01  0.89861745 18262 18322
12:  2 2020-12-01  0.59471124 18262 18322
13:  3 2021-01-01 -0.68267021 18262 18322
14:  4 2021-02-01  1.54694582 18262 18322
15:  5 2021-03-01 -0.05550008 18262 18322
16:  1 2021-04-01  0.07542291 18262 18322
17:  2 2021-05-01  0.34271740 18262 18322
18:  3 2021-06-01  1.04142837 18262 18322
19:  4 2021-07-01  0.75517388 18262 18322
20:  5 2021-08-01  0.50562581 18262 18322

The above code is of course not correct, that datetime is between start and end date, because I specify the first element of the search parameters hard. This gave me the expected result, but only for the first daterange. I wonder if this doesn't work with inrange() and I should use dplyr::between with rowwise() instead? But how can I do this, because the following does not work either.

filtered5 <- setDT(dt)[, `:=` (start = range[, start], ende = range[, end])]

The last step is to arrange them. For this I use the following. Is this the right wayo or should I use RANK(), DENSE_RANK(), ROW_NUMBER(). If all three functions are similar to SQL?

dt[ , valRank:=rank(-datetime), by= list(id, start date, end date)]

Any help would be very welcome. Sorry if it is written a bit jumbled, but these are my first steps in R and this is also my first question in this forum.

I wish you all a nice evening. Best regards

Upvotes: 0

Views: 632

Answers (2)

r2evans
r2evans

Reputation: 160952

This seems like a non-equi join.

library(data.table)
range[dt, on = .(start <= datetime, end >= datetime)]
#          start        end    ID          var
#         <Date>     <Date> <int>        <num>
#  1: 2020-01-01 2020-01-01     1 -0.535432498
#  2: 2020-02-01 2020-02-01     2 -0.713242072
#  3: 2020-03-01 2020-03-01     3 -0.246650715
#  4: 2020-04-01 2020-04-01     4  0.598074968
#  5: 2020-05-01 2020-05-01     5  1.321493645
#  6: 2020-06-01 2020-06-01     1  0.431644671
#  7: 2020-07-01 2020-07-01     2 -0.035380524
#  8: 2020-08-01 2020-08-01     3 -0.764943081
#  9: 2020-09-01 2020-09-01     4  1.123448021
# 10: 2020-10-01 2020-10-01     5  0.059585556
# 11: 2020-11-01 2020-11-01     1  0.987138106
# 12: 2020-12-01 2020-12-01     2 -0.229781596
# 13: 2021-01-01 2021-01-01     3 -1.446177021
# 14: 2021-02-01 2021-02-01     4 -0.364705735
# 15: 2021-03-01 2021-03-01     5  0.007079030
# 16: 2021-04-01 2021-04-01     1 -1.967512640
# 17: 2021-05-01 2021-05-01     2 -0.009452937
# 18: 2021-06-01 2021-06-01     3 -1.960379639
# 19: 2021-07-01 2021-07-01     4  0.704577974
# 20: 2021-08-01 2021-08-01     5  0.805717998
#          start        end    ID          var

The way that data.table preserves columns can be confusing or frustrating, noting that what appears to be range$start and end are more reminiscent of copies of the original dt$datetime. When doing things like this, for clarity I often copy columns into throw-away columns for the join.

out <- range[,c("s", "e") := .(start, end)
  ][dt, on = .(s <= datetime, e >= datetime)
  ][, valrank := rank(-xtfrm(s)), by = .(ID, start, end)
  ][, e := NULL ] |>
  setnames("s", "datetime")
out
#          start        end   datetime    ID          var valrank
#         <Date>     <Date>     <Date> <int>        <num>   <num>
#  1: 2020-01-01 2020-03-01 2020-01-01     1 -0.535432498       1
#  2: 2020-01-01 2020-03-01 2020-02-01     2 -0.713242072       1
#  3: 2020-01-01 2020-03-01 2020-03-01     3 -0.246650715       1
#  4:       <NA>       <NA> 2020-04-01     4  0.598074968       2
#  5:       <NA>       <NA> 2020-05-01     5  1.321493645       2
#  6:       <NA>       <NA> 2020-06-01     1  0.431644671       1
#  7: 2020-07-01 2021-05-01 2020-07-01     2 -0.035380524       3
#  8: 2020-07-01 2021-05-01 2020-08-01     3 -0.764943081       2
#  9: 2020-07-01 2021-05-01 2020-09-01     4  1.123448021       2
# 10: 2020-07-01 2021-05-01 2020-10-01     5  0.059585556       2
# 11: 2020-07-01 2021-05-01 2020-11-01     1  0.987138106       2
# 12: 2020-07-01 2021-05-01 2020-12-01     2 -0.229781596       2
# 13: 2020-07-01 2021-05-01 2021-01-01     3 -1.446177021       1
# 14: 2020-07-01 2021-05-01 2021-02-01     4 -0.364705735       1
# 15: 2020-07-01 2021-05-01 2021-03-01     5  0.007079030       1
# 16: 2020-07-01 2021-05-01 2021-04-01     1 -1.967512640       1
# 17: 2020-07-01 2021-05-01 2021-05-01     2 -0.009452937       1
# 18:       <NA>       <NA> 2021-06-01     3 -1.960379639       1
# 19:       <NA>       <NA> 2021-07-01     4  0.704577974       1
# 20:       <NA>       <NA> 2021-08-01     5  0.805717998       1
#          start        end   datetime    ID          var valrank

Upvotes: 0

DocBuckets
DocBuckets

Reputation: 261

Welcome. In general, it is useful to show us explicitly what your desired output is. Also, showing your inputs in expanded form is useful so highly skilled users don't have to run the code but rather simply type the answer here.

Here is a full working example of what I think you want:

Only required package:

library(dplyr)

Input dataset dt:

dt <-
  tibble(
    ID = rep(1:5, 4),
    datetime = seq(
      from = as.Date("01.01.2020", format = "%d.%m.%Y"),
      by = "month",
      length.out = 20
    )  ,
    var = rnorm(20)
  )

ranges dataset (bad idea to use "range" as a var name, as it is also a function)

ranges <-
  tibble(start = c(
    as.Date("01.01.2020", format = "%d.%m.%Y"),
    as.Date("01.07.2020", format = "%d.%m.%Y")
  ),
  end = c(
    as.Date("01.03.2020", format = "%d.%m.%Y"),
    as.Date("01.05.2021", format = "%d.%m.%Y")
  ))

The solution:

"Take dt, cross it exhastively with all rows of ranges, then filter to include only those rows where datetime is between start and end.

dt %>% 
  crossing(ranges) %>% 
  filter(datetime > start & datetime < end)

Result:

ID  datetime    var     start       end
1   2020-11-01  1.221   2020-07-01  2021-05-01
1   2021-04-01  0.021   2020-07-01  2021-05-01
2   2020-02-01  -0.18   2020-01-01  2020-03-01
2   2020-12-01  1.585   2020-07-01  2021-05-01
3   2020-08-01  -0.55   2020-07-01  2021-05-01
3   2021-01-01  0.554   2020-07-01  2021-05-01
4   2020-09-01  -0.66   2020-07-01  2021-05-01
4   2021-02-01  1.504   2020-07-01  2021-05-01
5   2020-10-01  0.659   2020-07-01  2021-05-01
5   2021-03-01  0.202   2020-07-01  2021-05-01

You can use arrange() as another piped line if you want to sort it another way e.g.:

dt %>% 
  crossing(range) %>% 
  filter(datetime > start & datetime < end) %>% 
  arrange(datetime, ID, start)

Upvotes: 0

Related Questions