Reputation: 359
i have a problem where i found myself no good solution in time. Really appreciate any assistance as i think its for some professionals in here just a few lines of code.
my data contains over 2 mio. rows of transactions. I want to do some sort of association rules on the data.
I´m just interested in transactions (t_ID) which have the Product(P_ID) "PANDORA" involved and where i know the customer (c_ID). I Made an example:
> T_ID <- c(10,10,10,11,12,13,13)
> P_ID <- c("PANDORA", "Others", "Pan","PANDORA","Ham", "PANDORA","Ham")
> c_ID <- c(1,1,1,2,-1,4,4)
> basket <- data.frame(T_ID,P_ID,c_ID)
> basket
T_ID P_ID c_ID
1 10 PANDORA 1
2 10 Others 1
3 10 Pan 1
4 11 PANDORA 2
5 12 Ham -1
6 13 PANDORA 4
7 13 Ham 4
Transaction 10 contains the product "Pandora", therefore all 3 Rows should remain in the dataset. While Transaction 12 has no Customer attached it needs to be removed.
Im struggling the most on the part how to keep the transaction rows which are related to the same transaction ID containing "PANDORA" but have another product stored.
Any help greatly appreciated,
Best regards, Christian
Upvotes: 0
Views: 96
Reputation: 23818
Maybe this helps:
keep_IDs <- basket$T_ID[with(basket, P_ID=="PANDORA" & c_ID!=-1)]
basket[basket$T_ID %in% keep_IDs,]
# T_ID P_ID c_ID
#1 10 PANDORA 1
#2 10 Others 1
#3 10 Pan 1
#4 11 PANDORA 2
#6 13 PANDORA 4
#7 13 Ham 4
data
basket <- structure(list(T_ID = c(10L, 10L, 10L, 11L, 12L, 13L, 13L, 14L, 14L),
P_ID = structure(c(6L, 4L, 5L, 6L, 1L, 6L, 1L, 3L, 2L),
.Label = c("Ham","Honey", "Nugget", "Others", "Pan", "PANDORA"), class = "factor"),
c_ID = c(1L, 1L, 1L, 2L, -1L, 4L, 4L, 5L, 5L)),
.Names = c("T_ID", "P_ID", "c_ID"), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9"))
Upvotes: 2
Reputation: 927
Does each transaction have only one customer id? I'm assuming so.
First step is to remove those rows which do not have customer id.
cleanbasket = basket[bucket$c_ID != -1,]
Next, we want to identify which transactions include PANDORA.
transactions = unique(basket$T_ID[basket$P_ID == "PANDORA"])
Then get all the rows for these transactions
cleanbasket = cleanbasket[cleanbasket$T_ID %in% transactions,]
Upvotes: 0