Reputation: 1
since a few days I'm working with R and now I have my first problem. I am a complete beginner in R.
I want to check the data of dt between the **start date **and the end date and then write the start date and end date each as a column in dt.
Then I want to order the data in descending order by datetime within the group (ID, start date, end date). For this I would then use the following:
dt[ , valRank:=rank(-datetime), by= list(id, start date, end date)]
My data and search parameters look like this:
library(data.table)
# data.table
dt <- data.table(ID = 1:5
, datetime = seq(from = as.Date("01.01.2020", format = "%d.%m.%Y"), by ="month", length.out = 20)
, var = rnorm(20))
# Search parameters
start <- c(as.Date("01.01.2020", format = "%d.%m.%Y"), as.Date("01.07.2020", format = "%d.%m.%Y"))
end <- c(as.Date("01.03.2020", format = "%d.%m.%Y"), as.Date("01.05.2021", format = "%d.%m.%Y"))
range <- data.table(start = start, end = end)
Now I have tried several filter versions and wonder why between() does not work but inrange() does?
# Version 0: Why it doesnt work with "between"
filtered0 <- dt[between(x= datetime, lower = range$start, upper = range$end, incbounds=TRUE)]
filtered0
ID datetime var
1: 1 2020-01-01 0.50298317
2: 3 2020-03-01 2.06969108
3: 3 2020-08-01 0.62906692
4: 5 2020-10-01 0.75532708
5: 2 2020-12-01 0.59471124
6: 4 2021-02-01 1.54694582
7: 1 2021-04-01 0.07542291
# Version 1:
filtered1 <- setDT(dt)[datetime %inrange% range]
filtered1
# Version 2:
filtered2<- dt[datetime %inrange% range]
# Version 3:
filtered3 <- setDT(dt)[inrange(x = datetime, lower = range$start, upper = range$end, incbounds=TRUE)]
Now I tried for the first time to write the start and end date into the data.table and wonder why the column "start" now contains numbers although the format as.Date(..., , format = "%d.%m.%Y") does not work.
filtered4 <- setDT(dt)[, `:=` (start = ifelse(datetime %inrange% range, range$start[1], range$start[1])
, end = ifelse(datetime %inrange% range, range$end[1], range$end[1] ) )]
ID datetime var start end
1: 1 2020-01-01 0.50298317 18262 18322
2: 2 2020-02-01 -0.26148738 18262 18322
3: 3 2020-03-01 2.06969108 18262 18322
4: 4 2020-04-01 1.99797128 18262 18322
5: 5 2020-05-01 -0.93218939 18262 18322
6: 1 2020-06-01 0.35842662 18262 18322
7: 2 2020-07-01 -1.07181184 18262 18322
8: 3 2020-08-01 0.62906692 18262 18322
9: 4 2020-09-01 1.34515866 18262 18322
10: 5 2020-10-01 0.75532708 18262 18322
11: 1 2020-11-01 0.89861745 18262 18322
12: 2 2020-12-01 0.59471124 18262 18322
13: 3 2021-01-01 -0.68267021 18262 18322
14: 4 2021-02-01 1.54694582 18262 18322
15: 5 2021-03-01 -0.05550008 18262 18322
16: 1 2021-04-01 0.07542291 18262 18322
17: 2 2021-05-01 0.34271740 18262 18322
18: 3 2021-06-01 1.04142837 18262 18322
19: 4 2021-07-01 0.75517388 18262 18322
20: 5 2021-08-01 0.50562581 18262 18322
The above code is of course not correct, that datetime is between start and end date, because I specify the first element of the search parameters hard. This gave me the expected result, but only for the first daterange. I wonder if this doesn't work with inrange() and I should use dplyr::between with rowwise() instead? But how can I do this, because the following does not work either.
filtered5 <- setDT(dt)[, `:=` (start = range[, start], ende = range[, end])]
The last step is to arrange them. For this I use the following. Is this the right wayo or should I use RANK(), DENSE_RANK(), ROW_NUMBER(). If all three functions are similar to SQL?
dt[ , valRank:=rank(-datetime), by= list(id, start date, end date)]
Any help would be very welcome. Sorry if it is written a bit jumbled, but these are my first steps in R and this is also my first question in this forum.
I wish you all a nice evening. Best regards
Upvotes: 0
Views: 632
Reputation: 160952
This seems like a non-equi join.
library(data.table)
range[dt, on = .(start <= datetime, end >= datetime)]
# start end ID var
# <Date> <Date> <int> <num>
# 1: 2020-01-01 2020-01-01 1 -0.535432498
# 2: 2020-02-01 2020-02-01 2 -0.713242072
# 3: 2020-03-01 2020-03-01 3 -0.246650715
# 4: 2020-04-01 2020-04-01 4 0.598074968
# 5: 2020-05-01 2020-05-01 5 1.321493645
# 6: 2020-06-01 2020-06-01 1 0.431644671
# 7: 2020-07-01 2020-07-01 2 -0.035380524
# 8: 2020-08-01 2020-08-01 3 -0.764943081
# 9: 2020-09-01 2020-09-01 4 1.123448021
# 10: 2020-10-01 2020-10-01 5 0.059585556
# 11: 2020-11-01 2020-11-01 1 0.987138106
# 12: 2020-12-01 2020-12-01 2 -0.229781596
# 13: 2021-01-01 2021-01-01 3 -1.446177021
# 14: 2021-02-01 2021-02-01 4 -0.364705735
# 15: 2021-03-01 2021-03-01 5 0.007079030
# 16: 2021-04-01 2021-04-01 1 -1.967512640
# 17: 2021-05-01 2021-05-01 2 -0.009452937
# 18: 2021-06-01 2021-06-01 3 -1.960379639
# 19: 2021-07-01 2021-07-01 4 0.704577974
# 20: 2021-08-01 2021-08-01 5 0.805717998
# start end ID var
The way that data.table
preserves columns can be confusing or frustrating, noting that what appears to be range$start
and end are more reminiscent of copies of the original dt$datetime
. When doing things like this, for clarity I often copy columns into throw-away columns for the join.
out <- range[,c("s", "e") := .(start, end)
][dt, on = .(s <= datetime, e >= datetime)
][, valrank := rank(-xtfrm(s)), by = .(ID, start, end)
][, e := NULL ] |>
setnames("s", "datetime")
out
# start end datetime ID var valrank
# <Date> <Date> <Date> <int> <num> <num>
# 1: 2020-01-01 2020-03-01 2020-01-01 1 -0.535432498 1
# 2: 2020-01-01 2020-03-01 2020-02-01 2 -0.713242072 1
# 3: 2020-01-01 2020-03-01 2020-03-01 3 -0.246650715 1
# 4: <NA> <NA> 2020-04-01 4 0.598074968 2
# 5: <NA> <NA> 2020-05-01 5 1.321493645 2
# 6: <NA> <NA> 2020-06-01 1 0.431644671 1
# 7: 2020-07-01 2021-05-01 2020-07-01 2 -0.035380524 3
# 8: 2020-07-01 2021-05-01 2020-08-01 3 -0.764943081 2
# 9: 2020-07-01 2021-05-01 2020-09-01 4 1.123448021 2
# 10: 2020-07-01 2021-05-01 2020-10-01 5 0.059585556 2
# 11: 2020-07-01 2021-05-01 2020-11-01 1 0.987138106 2
# 12: 2020-07-01 2021-05-01 2020-12-01 2 -0.229781596 2
# 13: 2020-07-01 2021-05-01 2021-01-01 3 -1.446177021 1
# 14: 2020-07-01 2021-05-01 2021-02-01 4 -0.364705735 1
# 15: 2020-07-01 2021-05-01 2021-03-01 5 0.007079030 1
# 16: 2020-07-01 2021-05-01 2021-04-01 1 -1.967512640 1
# 17: 2020-07-01 2021-05-01 2021-05-01 2 -0.009452937 1
# 18: <NA> <NA> 2021-06-01 3 -1.960379639 1
# 19: <NA> <NA> 2021-07-01 4 0.704577974 1
# 20: <NA> <NA> 2021-08-01 5 0.805717998 1
# start end datetime ID var valrank
Upvotes: 0
Reputation: 261
Welcome. In general, it is useful to show us explicitly what your desired output is. Also, showing your inputs in expanded form is useful so highly skilled users don't have to run the code but rather simply type the answer here.
Here is a full working example of what I think you want:
Only required package:
library(dplyr)
Input dataset dt
:
dt <-
tibble(
ID = rep(1:5, 4),
datetime = seq(
from = as.Date("01.01.2020", format = "%d.%m.%Y"),
by = "month",
length.out = 20
) ,
var = rnorm(20)
)
ranges
dataset (bad idea to use "range" as a var name, as it is also a function)
ranges <-
tibble(start = c(
as.Date("01.01.2020", format = "%d.%m.%Y"),
as.Date("01.07.2020", format = "%d.%m.%Y")
),
end = c(
as.Date("01.03.2020", format = "%d.%m.%Y"),
as.Date("01.05.2021", format = "%d.%m.%Y")
))
The solution:
"Take dt
, cross it exhastively with all rows of ranges
, then filter to include only those rows where datetime
is between start
and end
.
dt %>%
crossing(ranges) %>%
filter(datetime > start & datetime < end)
Result:
ID datetime var start end
1 2020-11-01 1.221 2020-07-01 2021-05-01
1 2021-04-01 0.021 2020-07-01 2021-05-01
2 2020-02-01 -0.18 2020-01-01 2020-03-01
2 2020-12-01 1.585 2020-07-01 2021-05-01
3 2020-08-01 -0.55 2020-07-01 2021-05-01
3 2021-01-01 0.554 2020-07-01 2021-05-01
4 2020-09-01 -0.66 2020-07-01 2021-05-01
4 2021-02-01 1.504 2020-07-01 2021-05-01
5 2020-10-01 0.659 2020-07-01 2021-05-01
5 2021-03-01 0.202 2020-07-01 2021-05-01
You can use arrange()
as another piped line if you want to sort it another way e.g.:
dt %>%
crossing(range) %>%
filter(datetime > start & datetime < end) %>%
arrange(datetime, ID, start)
Upvotes: 0