Reputation: 333
I have a simple dataframe containing three columns: An id, a date and a value. Now, I want to calculate a new value, newValue, based on these three columns following this procedure:
The simple example below with made-up numbers does the computation:
df <- data.frame("id"=rep(1:10, 5),
"date"=c(rep(2000, 10), rep(2001, 10), rep(2002, 10), rep(2003, 10), rep(2004, 10)),
"value"=c(rep(1, 10), rep(2, 10), rep(3, 10), rep(4, 10), rep(5, 10)))
df$newValue <- 1 #initialize
for(idx in 1:dim(df)[1]) {
id <- df[idx, "id"]
lower <- df[idx, "date"]
upper <- lower + 3
df[idx, "newValue"] <- prod(df[(df$id == id) & (df$date >= lower) & (df$date < upper), ]$value + 1) - 1
}
This gives me the output (I have annotated it for simplicity):
id date value newValue
1 1 2000 1 23 (= (1+1) * (2+1) * (3+1) - 1 = 23)
2 2 2000 1 23 (= (1+1) * (2+1) * (3+1) - 1 = 23)
....
12 2 2001 2 59 (= (2+1) * (3+1) * (4+1) - 1 = 59)
....
22 2 2002 3 119 (= (3+1) * (4+1) * (5+1) - 1 = 119)
....
However, my final dataframe has +1million rows, so the code above is very time-consuming and inefficient.
Is there a way to speed it up, perhaps using a data.table? Note that each id may have a different number of rows, so that I why I explicitly subset.
Upvotes: 0
Views: 268
Reputation: 26238
I think this tidyverse solution can also do the job.
In order to address the missing year/date problem, I have deleted two rows from id == 2. Sample data used
> dput(df)
structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L,
6L, 6L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L,
9L, 10L, 10L, 10L, 10L, 10L), date = c(2000, 2001, 2002, 2003,
2004, 2000, 2001, 2004, 2000, 2001, 2002, 2003, 2004, 2000, 2001,
2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002,
2003, 2004, 2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003,
2004, 2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004
), value = c(1, 2, 3, 4, 5, 1, 2, 5, 1, 2, 3, 4, 5, 1, 2, 3,
4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4,
5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5)), class = "data.frame", row.names = c(NA,
-48L))
df
# A tibble: 48 x 3
id date value
<int> <dbl> <dbl>
1 1 2000 1
2 1 2001 2
3 1 2002 3
4 1 2003 4
5 1 2004 5
6 2 2000 1
7 2 2001 2
8 2 2004 5
9 3 2000 1
10 3 2001 2
# ... with 38 more rows
Now the tidyverse solution part
library(tidyverse)
df %>% arrange(id, date) %>%
group_by(id) %>%
complete(date = min(date):max(date), fill = list(value = 0)) %>%
mutate(new_val = (value +1)*(lead(value, default = 0)+1)*(lead(value, n=2, default = 0)+1)-1) %>%
ungroup()
# A tibble: 50 x 4
id date value new_val
<int> <dbl> <dbl> <dbl>
1 1 2000 1 23
2 1 2001 2 59
3 1 2002 3 119
4 1 2003 4 29
5 1 2004 5 5
6 2 2000 1 5
7 2 2001 2 2
8 2 2002 0 5
9 2 2003 0 5
10 2 2004 5 5
# ... with 40 more rows
EDIT Moreover, if extra years have to be removed
df %>% arrange(id, date) %>%
group_by(id) %>%
complete(date = min(date):max(date), fill = list(value = 0)) %>%
mutate(new_val = (value +1)*(lead(value, default = 0)+1)*(lead(value, n=2, default = 0)+1)-1) %>%
ungroup() %>% right_join(df, by = c("id", "date", "value"))
Upvotes: 1
Reputation: 5232
library(data.table)
library(purrr)
setDT(df)[, newValue := map_dbl(date, ~prod(value[between(date, .x, .x + 2)] + 1) - 1), by = id]
gives (only showing for id = 1
):
id date value newValue
1: 1 2000 1 23
2: 1 2001 2 59
3: 1 2002 3 119
4: 1 2003 4 29
5: 1 2004 5 5
update:
because every date
is at most once in each id
this should be more efficient:
df <- setDT(df)[order(id, date)]
df[,
newValue := map2_dbl(
date, map(seq_len(.N), ~.x:min(.x+2, .N)),
~prod(value[.y][between(date[.y], .x, .x + 2)] + 1) - 1
),
by = id
]
if you want some other number than 2
you can create some varialbe date_range
and replace 2
with date_range
Upvotes: 2