Reputation: 920
While doing my data work I have this problem.
I have customer id, receipt_id and product_id. The product_id indicates the products that the given customer purchased at the given receipt.
The data is sorted by customer id and receipt_id. The lower value of receipt_id means the earlier shopping trip.
For each product, I want to create dummy variable that indicate whether each product is purchased in past shopping trip (in previous receipt id).
I have first three columns and want to create 4th column, "purchased_before".
I can do it by using for loop but is there any efficient way?
Data is as below,
customer id receipt_id product_id purchased_before
1 1 113 0
1 1 114 0
1 2 113 1
1 2 116 0
1 2 346 0
1 3 421 0
1 3 114 1
1 3 421 0
....
2 1 213 0
2 1 114 0
2 2 113 0
2 2 116 0
2 2 346 0
2 3 113 1
2 3 114 1
2 3 421 0
....
Upvotes: 0
Views: 53
Reputation: 16277
Here's how to do that with dplyr
. Note that this solution picks up that product_id 421 has been purchased before by customer id 1. Your question does not show that.
df <-read.table(text="customer_id receipt_id product_id purchased_before
1 1 113 0
1 1 114 0
1 2 113 1
1 2 116 0
1 2 346 0
1 3 421 0
1 3 114 1
1 3 421 0
2 1 213 0
2 1 114 0
2 2 113 0
2 2 116 0
2 2 346 0
2 3 113 1
2 3 114 1
2 3 421 0",header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
df %>%
group_by(customer_id,product_id) %>%
mutate(purchased_before2=duplicated(product_id)*1)
customer_id receipt_id product_id purchased_before purchased_before2
<int> <int> <int> <int> <dbl>
1 1 1 113 0 0
2 1 1 114 0 0
3 1 2 113 1 1
4 1 2 116 0 0
5 1 2 346 0 0
6 1 3 421 0 0
7 1 3 114 1 1
8 1 3 421 0 1
9 2 1 213 0 0
10 2 1 114 0 0
11 2 2 113 0 0
12 2 2 116 0 0
13 2 2 346 0 0
14 2 3 113 1 1
15 2 3 114 1 1
16 2 3 421 0 0
Upvotes: 1