Reputation: 105
I have two data tables. A table with customer orders (it shows a customer ID, and the order date when a purchase was made) and a table with customer segmentation (it shows in which segment a customer was classified as in a certain time period).
I want to add the segment from data table 2) as a new variable in data table 1) but of course only the segment the customer was in at the time of the order.
Customer_Orders <- data.table(
customer_ID = c("A", "A"),
order_date = c("2017-06-30", "2019-07-30")
)
head(Customer_Orders)
customer_ID order_date
1: A 2017-06-30
2: A 2018-07-30
Customer_Segmentation <- data.table(
customer_ID = c("A", "A", "A"),
segment = c("1", "2", "3"),
valid_from = c("2017-01-01", "2018-01-01", "2019-01-01"),
valid_until = c("2017-12-31", "2018-12-31", "2019-12-31")
)
head(Customer_Segmentation)
customer_ID segment valid_from valid_until
1: A 1 2017-01-01 2017-12-31
2: A 2 2018-01-01 2018-12-31
3: A 3 2019-01-01 2019-12-31
This here is the manually constructed result I´m looking for
Result <- data.table(
customer_ID = c("A", "A"),
order_date = c("2017-06-30", "2019-07-30"),
segment = c(1, 3)
)
head(Result)
customer_ID order_date segment
1: A 2017-06-30 1
2: A 2019-07-30 3
Currently, my solution consists of doing a right-join to basically add all possible segments to each line in the customer orders table, and then exclude all rows where the order date is not in between the period of the segment. However, as my dataset is huge, this is a really slow and cumbersome solution
Upvotes: 3
Views: 120
Reputation: 4242
Here's how I would approach the problem:
Date
vectors)Customer_Orders <- data.table(
customer_ID = c("A", "A"),
order_date = as.Date(c("2017-06-30", "2019-07-30"))
)
Customer_Segmentation <- data.table(
customer_ID = c("A", "A", "A"),
segment = c("1", "2", "3"),
valid_from = as.Date(c("2017-01-01", "2018-01-01", "2019-01-01")),
valid_until = as.Date(c("2017-12-31", "2018-12-31", "2019-12-31"))
)
When using the A[B]
syntax supported by data.table
, it's relatively simple to add a single column from the B
table to the original A
table by using the i.
prefix to reference columns in B
. The remaining portion is just the on
statement, which can be defined as a list using the .()
notation in data.table
with any number of conditions.
Customer_Orders[Customer_Segmentation, segment := i.segment, on = .(customer_ID==customer_ID,
order_date>=valid_from,
order_date<valid_until)]
print(Customer_Orders)
# customer_ID order_date segment
#1: A 2017-06-30 1
#2: A 2019-07-30 3
Upvotes: 0
Reputation: 29237
Probably the easiest method would be using sqldf package:
library(sqldf)
sqldf("select * from Customer_Orders
left join Customer_Segmentation
on order_date between valid_from and valid_until
and Customer_Orders.ID = Customer_Segmentation.ID")
# customer_ID order_date customer_ID..3 segment valid_from valid_until
# 1 A 2017-06-31 A 1 2017-01-01 2017-12-31
# 2 A 2019-07-30 A 3 2019-01-01 2019-12-31
It simply joins the tables if the date falls between the period of time provided
But if you insist on using data.table look below;
setkey(Customer_Segmentation,customer_ID,valid_from)
setkey(Customer_Orders,customer_ID,order_date)
ans <- Customer_Segmentation[Customer_Orders,list(.valid_from=valid_from,
valid_until,order_date,segment),
by=.EACHI,roll=T][,`:=`(.valid_from=NULL)]
ans
# customer_ID valid_from valid_until order_date segment
# 1: A 2017-06-31 2017-12-31 2017-06-31 1
# 2: A 2019-07-30 2019-12-31 2019-07-30 3
It is easy to get rid of extra columns if unwanted.
Upvotes: 3
Reputation: 1708
How's this?
Your data (fixed):
library(tidyverse)
library(lubridate)
Customer_Orders <- tibble(
customer_ID = c("A", "A"),
order_date = c("2017-06-30", "2019-07-30"))
Customer_Segmentation <- tibble(
customer_ID = c("A", "A", "A"),
segment = c("1", "2", "3"),
valid_from = c("2017-01-01", "2018-01-01", "2019-01-01"),
valid_until = c("2017-12-31", "2018-12-31", "2019-12-31"))
Code - the first two tables are just to create dates from the initial tables using lubridate
. The next one joins everything.
Customer_Orders2 <- Customer_Orders %>%
mutate(order_date = ymd(order_date))
Customer_Segmentation2 <- Customer_Segmentation %>%
mutate(valid_from = ymd(valid_from)) %>%
mutate(valid_until = ymd(valid_until))
Customer_Orders_join <- full_join(Customer_Orders2, Customer_Segmentation2)
This picks out the segments based on the interval.
Customer_Orders3 <- Customer_Orders_join %>%
filter(order_date %within% interval(valid_from, valid_until))
This produces:
# A tibble: 2 x 5
customer_ID order_date segment valid_from valid_until
<chr> <date> <chr> <date> <date>
1 A 2017-06-30 1 2017-01-01 2017-12-31
2 A 2019-07-30 3 2019-01-01 2019-12-31
Upvotes: 0