Reputation: 159
I have a data frame "customers" build of customer id, month and total purchases that month. I'm trying to calculate a running slope for a window of 12 months using robust regression.
I have tried the following:
Coef <- function(x) {return(rlm(cbind(x)~cbind(1:length(x)))$coefficients[2])}
customer_slope = customers %>% mutate(slope = runner(x=total_purchases,k=12,f=Coef))
I get the following error:
x 'x' is singular: singular fits are not implemented in 'rlm'
If I run a single example, the function returns what I've expected:
Coef(c(4,11,7,15,5,14,8,9,14,17,14,13))
cbind(1:length(x)) 0.6888112
Upvotes: 1
Views: 247
Reputation: 2877
Consider data with two customers with data from 1000 days span. total_purchases
are cumulated by customer, and each purchase size is ~pois(5).
set.seed(1)
customers <- data.frame(
id = factor(rep(1:2, length.out = 100)),
date = seq(Sys.Date(), Sys.Date() + 1000, length.out = 100)
) %>%
group_by(id) %>%
mutate(
total_purchases = cumsum(rpois(n(), lambda = 5))
)
When using calculating regression in rolling window make sure that you handle errors which comming from insufficient degrees of freedom, singularity etc. - that is why I've put tryCatch
around rlm
call - if there is any error, function returns NA for failing window.
Data below is grouped by id
which means that model is calculated per customer. Yearly rolling regression should converge to the slope = 5
(+/- random error).
customers %>%
group_by(id) %>%
mutate(
slope = runner(
x = .,
f = function(x) {
tryCatch(
rlm(x$total_purchases ~ seq_len(nrow(x)))$coefficients[2],
error = function(e) NA
)
},
idx = "date",
k = "year"
)
)
Plotting slope in time for customers
ggplot(customers, aes(x = date, y = slope, color = id, group = id)) +
geom_line() +
geom_hline(yintercept = 5, color = "red")
Upvotes: 0
Reputation: 159
It look like that's the way to go, thanks! It seems like what caused the singularity was that I didn't change the default .complete from F to T. So, combined with your suggestion, this is how I made it work (took about two hours for 3M rows I did have however more complex group_by involved which is not shown below)
slope_rlm <- function(x) {
x=as.numeric(x)
prep = tibble(data=x)%>%mutate(t=1:n()%>%as.numeric())
return(rlm(data~t,data=prep)$coefficients[2])
}
customers_rlm = customers %>%
mutate(cust_rlm_12=slide_dbl(total_purchases,slope_rlm,.before=11,.complete=T))
Upvotes: 0
Reputation: 2368
So I ran into similar problems and finally came to the below solution using slider. This provides a 3 days rolling estimate (of course you can change as you see fit). This doesn't quite get to your answer (which you could probably get with loops), but most of the way there.
library(MASS)
library(dplyr)
library(slider)
dat <- tibble::tibble(customers = c(4,11,7,15,5,14,8,9,14,17,14,13)) %>%
mutate(t = 1:n() %>% as.numeric())
dat %>%
mutate(results = slide_dbl(.x = .,
.f = ~rlm(customers ~ t, k = 12, data = .x)$coefficients[2],
.before = 2,
.complete = T))
Upvotes: 1