Reputation: 23
I have the folllowing data, where ID stands for an individual, Date for the date, and Purchased for whether somebody made a purchase (I made this last one so that I can count the the occurences):
ID Date Purchased
1 1 2017-01-01 1
2 1 2017-08-03 1
3 1 2017-09-02 1
4 2 2017-09-04 1
5 2 2018-07-12 1
6 2 2018-11-03 1
7 2 2018-12-05 1
8 2 2019-01-01 1
9 3 2018-02-03 1
10 3 2020-02-03 1
11 3 2020-03-01 1
I would like to create a variable called "Frequency" that calculates the number of times an individual has made a purchase in the past year by summing up all the "Purchased" before the specific Date you see in the data frame.
So for example, for row 3 this would lead to a "Frequency" of 2 since 2017-01-01
and 2017-08-03
are both within a one-year time period from 2017-09-02
(so within the interval of 2016-09-02
and 2017-09-01
).
See desired output:
ID Date Purchased Frequency
1 1 2017-01-01 1 0
2 1 2017-08-03 1 1
3 1 2017-09-02 1 2
4 2 2017-09-04 1 0
5 2 2018-07-12 1 1
6 2 2018-11-03 1 1
7 2 2018-12-05 1 2
8 2 2019-01-01 1 3
9 3 2018-02-03 1 0
10 3 2020-02-03 1 0
11 3 2020-03-01 1 1
To reproduce the dataframe:
df <- data.frame(ID = c(1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3), Date = as.Date(c('2017-01-01', '2017-08-03', '2017-09-02', '2017-09-04', '2018-07-12', '2018-11-03', '2018-12-05', '2019-01-01', '2018-02-03', '2020-02-03', '2020-03-01')), Purchased = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ))
I've searched on stackoverlow but haven't been able to find an answer yet that I'm able to apply to my situation and obtain the desired results. One of the things that I found and tried was this:
df$frequency <-
sapply(df$Date, function(x){
sum(df$Date < x & df$Date >= x - 365)
})
I believe this might give me the results I want if I can find a way to include that it groups by ID (so it sums per ID and not overall). Can't say for sure of course since I haven't been able to test it out. Any help is much appreciated.
Upvotes: 2
Views: 757
Reputation: 388862
Here's a tidyverse
solution :
library(dplyr)
library(purrr)
library(lubridate)
df %>%
group_by(ID) %>%
mutate(Frequency = map_dbl(Date,
~sum(Purchased[between(Date, .x - years(1), .x - 1)]))) %>%
ungroup
# ID Date Purchased Frequency
# <dbl> <date> <dbl> <dbl>
# 1 1 2017-01-01 1 0
# 2 1 2017-08-03 1 1
# 3 1 2017-09-02 1 2
# 4 2 2017-09-04 1 0
# 5 2 2018-07-12 1 1
# 6 2 2018-11-03 1 1
# 7 2 2018-12-05 1 2
# 8 2 2019-01-01 1 3
# 9 3 2018-02-03 1 0
#10 3 2020-02-03 1 0
#11 3 2020-03-01 1 1
The logic of the code is for every Date
in each ID
it sum
s the Purchased
value between current date - 1 year and current date - 1 day.
Upvotes: 2
Reputation: 41220
You could use non-equi joins with data.table
:
library(data.table)
setDT(df)
df[,c("Date","Before"):=.(as.Date(Date),as.Date(Date)-365)]
df[df,.(ID, Date),on=.(ID=ID, Date>=Before, Date<=Date)][,.N-1,by=.(ID,Date)]
ID Date V1
1: 1 2017-01-01 0
2: 1 2017-08-03 1
3: 1 2017-09-02 2
4: 2 2017-09-04 0
5: 2 2018-07-12 1
6: 2 2018-11-03 1
7: 2 2018-12-05 2
8: 2 2019-01-01 3
9: 3 2018-02-03 0
10: 3 2020-02-03 0
11: 3 2020-03-01 1
Upvotes: 1