Reputation: 150

data.table alternative for slow group_by() and case_when() function

In my data i have customer-ids, orderdates and an indicator if an order contained a type of product. I want to give an indicator to each customer, if his first order contained this type of product. But because my data is pretty big i cannot use group_by and case_when, because it is way too slow. I think i could speed things up a lot by using data.table.

Could you point me to a solution? I haven´t had any contact with data.table until now...

# generate data
id <- round(rnorm(3000, mean = 5000, 400),0)
date <- seq.Date(as.Date("2018-01-01"), as.Date("2018-12-31"), "day")
date <- sample(date, length(id), replace = TRUE)
indicator <-  rbinom(length(id), 1, 0.5)

df <- data.frame(id, date, indicator)
df$id <- as.factor(df$id)

# Does the first Order contain X?
df <- df %>% group_by(id) %>% mutate(First_Order_contains_x = case_when(
  date == min(date) & indicator == "1" ~ 1,
  TRUE ~ 0
)) %>% ungroup() 

# If first order > 1 ==> all orders get 1 // 
df <- df %>% group_by(id) %>% mutate(Customer_type = case_when(
  sum(First_Order_contains_x) > 0 ~ "Customer with X in first order",
  TRUE ~ "Customer without x in first order"
)) %>% ungroup()

Upvotes: 1

Answers (3)

chinsoon12

Reputation: 25225

Another data.table approach. Sort the data first so that the first date is the earliest date and we can then use the first indicator for testing the condition. Then, convert logical to an integer (FALSE -> 1 and TRUE -> 2) and map into desired output using a character vector.

library(data.table)
setDT(df)
setorder(df, id, date)
map <- c("Customer without x in first order", "Customer with X in first order")
df[, idx := 1L+any(indicator[1L]==1L), by=.(id)][,
    First_Order_contains_x := map[idx]]

If the original order is important, we can store the original order using df[, rn := .I] then finally setorder(df, rn).

data:

set.seed(0L)
id <- round(rnorm(3000, mean = 5000, 5),0)
date <- seq.Date(as.Date("2018-01-01"), as.Date("2018-12-31"), "day")
date <- sample(date, length(id), replace = TRUE)
indicator <-  rbinom(length(id), 1, 0.5)

df <- data.frame(id, date, indicator)
df$id <- as.factor(df$id)

Upvotes: 0

Frank

Reputation: 66819

Another way:

library(data.table)
DT = data.table(df[, 1:3])

lookupDT = DT[, .(date = min(date)), by=id]
lookupDT[, fx := DT[copy(.SD), on=.(id, date), max(indicator), by=.EACHI]$V1]

DT[, v := "Customer without x in first order"]
DT[lookupDT[fx == 1L], on=.(id), v := "Customer with X in first order"]

# check results
fsetequal(DT[, .(id, v)], data.table(id = df$id, v = df$Customer_type))
# [1] TRUE

If you want more speed improvements, maybe see ?IDate.

The copy on .SD is needed due to an open issue.

Upvotes: 2

talat

Reputation: 70266

Here's how you can improve your existing code using dplyr more efficiently:

lookup = data.frame(First_Order_contains_x = c(TRUE, FALSE), 
                    Customer_Type = c("Customer with X in first order", 
                                      "Customer without x in first order"))

df %>% 
  group_by(id) %>% 
  mutate(First_Order_contains_x = any(as.integer(date == min(date) & indicator == 1))) %>% 
  ungroup() %>% 
  left_join(lookup, by = "First_Order_contains_x")

# A tibble: 3,000 x 5
   id    date       indicator First_Order_contains_x Customer_Type                    
   <fct> <date>         <dbl> <lgl>                  <fct>                            
 1 5056  2018-03-10         1 TRUE                   Customer with X in first order   
 2 5291  2018-12-28         0 FALSE                  Customer without x in first order
 3 5173  2018-04-19         0 FALSE                  Customer without x in first order
 4 5159  2018-11-13         0 TRUE                   Customer with X in first order   
 5 5252  2018-05-30         0 TRUE                   Customer with X in first order   
 6 5200  2018-01-20         0 FALSE                  Customer without x in first order
 7 4578  2018-12-18         1 FALSE                  Customer without x in first order
 8 5308  2018-03-24         1 FALSE                  Customer without x in first order
 9 5234  2018-05-29         1 TRUE                   Customer with X in first order   
10 5760  2018-06-12         1 TRUE                   Customer with X in first order   
# … with 2,990 more rows

Upvotes: 0

data.table alternative for slow group_by() and case_when() function

Answers (3)

Related Questions