Jenny We
Jenny We

Reputation: 105

Look-up in R across data tables with "IF" condition

I have two data tables. A table with customer orders (it shows a customer ID, and the order date when a purchase was made) and a table with customer segmentation (it shows in which segment a customer was classified as in a certain time period).

I want to add the segment from data table 2) as a new variable in data table 1) but of course only the segment the customer was in at the time of the order.

Customer_Orders <- data.table(
 customer_ID = c("A", "A"),
 order_date = c("2017-06-30", "2019-07-30")
)
head(Customer_Orders)
  customer_ID order_date
1:           A 2017-06-30
2:           A 2018-07-30


Customer_Segmentation <- data.table(
 customer_ID = c("A", "A", "A"),
 segment = c("1", "2", "3"),
 valid_from = c("2017-01-01", "2018-01-01", "2019-01-01"),
 valid_until = c("2017-12-31", "2018-12-31", "2019-12-31")
)
head(Customer_Segmentation)
   customer_ID segment valid_from valid_until
1:           A       1  2017-01-01 2017-12-31
2:           A       2  2018-01-01 2018-12-31
3:           A       3  2019-01-01 2019-12-31

This here is the manually constructed result I´m looking for

Result <- data.table(
 customer_ID = c("A", "A"),
 order_date = c("2017-06-30", "2019-07-30"),
 segment = c(1, 3)
)
head(Result)
   customer_ID order_date segment
1:           A 2017-06-30       1
2:           A 2019-07-30       3

Currently, my solution consists of doing a right-join to basically add all possible segments to each line in the customer orders table, and then exclude all rows where the order date is not in between the period of the segment. However, as my dataset is huge, this is a really slow and cumbersome solution

Upvotes: 3

Views: 120

Answers (3)

Matt Summersgill
Matt Summersgill

Reputation: 4242

Here's how I would approach the problem:

Data Generation (defining as proper Date vectors)

Customer_Orders <- data.table(
  customer_ID = c("A", "A"),
  order_date = as.Date(c("2017-06-30", "2019-07-30"))
)


Customer_Segmentation <- data.table(
  customer_ID = c("A", "A", "A"),
  segment = c("1", "2", "3"),
  valid_from =  as.Date(c("2017-01-01", "2018-01-01", "2019-01-01")),
  valid_until =  as.Date(c("2017-12-31", "2018-12-31", "2019-12-31"))
)

Non-equi Update Join to Add Segment

When using the A[B] syntax supported by data.table, it's relatively simple to add a single column from the B table to the original A table by using the i. prefix to reference columns in B. The remaining portion is just the on statement, which can be defined as a list using the .() notation in data.table with any number of conditions.

Customer_Orders[Customer_Segmentation, segment := i.segment, on = .(customer_ID==customer_ID,
                                                                    order_date>=valid_from,
                                                                    order_date<valid_until)]

print(Customer_Orders)
#   customer_ID order_date segment
#1:           A 2017-06-30       1
#2:           A 2019-07-30       3

Upvotes: 0

M--
M--

Reputation: 29237

Probably the easiest method would be using package:

library(sqldf)
sqldf("select * from Customer_Orders
               left join Customer_Segmentation
               on order_date between valid_from and valid_until
               and Customer_Orders.ID = Customer_Segmentation.ID")


# customer_ID order_date customer_ID..3 segment valid_from valid_until
# 1           A 2017-06-31              A       1 2017-01-01  2017-12-31
# 2           A 2019-07-30              A       3 2019-01-01  2019-12-31

It simply joins the tables if the date falls between the period of time provided

But if you insist on using look below;

setkey(Customer_Segmentation,customer_ID,valid_from)
setkey(Customer_Orders,customer_ID,order_date)

 ans <- Customer_Segmentation[Customer_Orders,list(.valid_from=valid_from,
                                                    valid_until,order_date,segment),
                      by=.EACHI,roll=T][,`:=`(.valid_from=NULL)]

 ans


# customer_ID valid_from valid_until order_date segment
# 1:           A 2017-06-31  2017-12-31 2017-06-31       1
# 2:           A 2019-07-30  2019-12-31 2019-07-30       3

It is easy to get rid of extra columns if unwanted.

Upvotes: 3

william3031
william3031

Reputation: 1708

How's this?

Your data (fixed):

library(tidyverse)
library(lubridate)

Customer_Orders <- tibble(
  customer_ID = c("A", "A"),
  order_date = c("2017-06-30", "2019-07-30"))

Customer_Segmentation <- tibble(
  customer_ID = c("A", "A", "A"),
  segment = c("1", "2", "3"),
  valid_from = c("2017-01-01", "2018-01-01", "2019-01-01"),
  valid_until = c("2017-12-31", "2018-12-31", "2019-12-31"))

Code - the first two tables are just to create dates from the initial tables using lubridate. The next one joins everything.

Customer_Orders2 <- Customer_Orders %>% 
  mutate(order_date = ymd(order_date))

Customer_Segmentation2 <- Customer_Segmentation %>% 
  mutate(valid_from = ymd(valid_from)) %>% 
  mutate(valid_until = ymd(valid_until))

Customer_Orders_join <- full_join(Customer_Orders2, Customer_Segmentation2)

This picks out the segments based on the interval.

Customer_Orders3 <- Customer_Orders_join %>% 
  filter(order_date %within% interval(valid_from, valid_until))

This produces:

# A tibble: 2 x 5
  customer_ID order_date segment valid_from valid_until
  <chr>       <date>     <chr>   <date>     <date>     
1 A           2017-06-30 1       2017-01-01 2017-12-31 
2 A           2019-07-30 3       2019-01-01 2019-12-31 

Upvotes: 0

Related Questions