Reputation: 150
In my data i have customer-ids, orderdates and an indicator if an order contained a type of product. I want to give an indicator to each customer, if his first order contained this type of product. But because my data is pretty big i cannot use group_by and case_when, because it is way too slow. I think i could speed things up a lot by using data.table.
Could you point me to a solution? I haven´t had any contact with data.table until now...
# generate data
id <- round(rnorm(3000, mean = 5000, 400),0)
date <- seq.Date(as.Date("2018-01-01"), as.Date("2018-12-31"), "day")
date <- sample(date, length(id), replace = TRUE)
indicator <- rbinom(length(id), 1, 0.5)
df <- data.frame(id, date, indicator)
df$id <- as.factor(df$id)
# Does the first Order contain X?
df <- df %>% group_by(id) %>% mutate(First_Order_contains_x = case_when(
date == min(date) & indicator == "1" ~ 1,
TRUE ~ 0
)) %>% ungroup()
# If first order > 1 ==> all orders get 1 //
df <- df %>% group_by(id) %>% mutate(Customer_type = case_when(
sum(First_Order_contains_x) > 0 ~ "Customer with X in first order",
TRUE ~ "Customer without x in first order"
)) %>% ungroup()
Upvotes: 1
Views: 541
Reputation: 25225
Another data.table
approach. Sort the data first so that the first date is the earliest date and we can then use the first indicator for testing the condition. Then, convert logical to an integer (FALSE
-> 1
and TRUE
-> 2
) and map into desired output using a character vector.
library(data.table)
setDT(df)
setorder(df, id, date)
map <- c("Customer without x in first order", "Customer with X in first order")
df[, idx := 1L+any(indicator[1L]==1L), by=.(id)][,
First_Order_contains_x := map[idx]]
If the original order is important, we can store the original order using df[, rn := .I]
then finally setorder(df, rn)
.
data:
set.seed(0L)
id <- round(rnorm(3000, mean = 5000, 5),0)
date <- seq.Date(as.Date("2018-01-01"), as.Date("2018-12-31"), "day")
date <- sample(date, length(id), replace = TRUE)
indicator <- rbinom(length(id), 1, 0.5)
df <- data.frame(id, date, indicator)
df$id <- as.factor(df$id)
Upvotes: 0
Reputation: 66819
Another way:
library(data.table)
DT = data.table(df[, 1:3])
lookupDT = DT[, .(date = min(date)), by=id]
lookupDT[, fx := DT[copy(.SD), on=.(id, date), max(indicator), by=.EACHI]$V1]
DT[, v := "Customer without x in first order"]
DT[lookupDT[fx == 1L], on=.(id), v := "Customer with X in first order"]
# check results
fsetequal(DT[, .(id, v)], data.table(id = df$id, v = df$Customer_type))
# [1] TRUE
If you want more speed improvements, maybe see ?IDate
.
The copy
on .SD
is needed due to an open issue.
Upvotes: 2
Reputation: 70266
Here's how you can improve your existing code using dplyr more efficiently:
lookup = data.frame(First_Order_contains_x = c(TRUE, FALSE),
Customer_Type = c("Customer with X in first order",
"Customer without x in first order"))
df %>%
group_by(id) %>%
mutate(First_Order_contains_x = any(as.integer(date == min(date) & indicator == 1))) %>%
ungroup() %>%
left_join(lookup, by = "First_Order_contains_x")
# A tibble: 3,000 x 5
id date indicator First_Order_contains_x Customer_Type
<fct> <date> <dbl> <lgl> <fct>
1 5056 2018-03-10 1 TRUE Customer with X in first order
2 5291 2018-12-28 0 FALSE Customer without x in first order
3 5173 2018-04-19 0 FALSE Customer without x in first order
4 5159 2018-11-13 0 TRUE Customer with X in first order
5 5252 2018-05-30 0 TRUE Customer with X in first order
6 5200 2018-01-20 0 FALSE Customer without x in first order
7 4578 2018-12-18 1 FALSE Customer without x in first order
8 5308 2018-03-24 1 FALSE Customer without x in first order
9 5234 2018-05-29 1 TRUE Customer with X in first order
10 5760 2018-06-12 1 TRUE Customer with X in first order
# … with 2,990 more rows
Upvotes: 0