Reputation: 621
I have implemented Apriori algorithm on my dataset. The rules I get though are inverted repititions that is:
inspect(head(rules))
lhs rhs support confidence lift count
[1] {252-ON-OFF} => {L30-ATLANTIC} 0.04545455 1 22 1
[2] {L30-ATLANTIC} => {252-ON-OFF} 0.04545455 1 22 1
[3] {252-ON-OFF} => {M01-A molle biconiche} 0.04545455 1 22 1
[4] {M01-A molle biconiche} => {252-ON-OFF} 0.04545455 1 22 1
[5] {L30-ATLANTIC} => {M01-A molle biconiche} 0.04545455 1 22 1
[6] {M01-A molle biconiche} => {L30-ATLANTIC} 0.04545455 1 22 1
As can be seen rule 1 & rule 2 are the same just the LHS & RHS are interchanged. Is there any way to remove such rules from the final result?
I saw this post link but the proposed solution is not correct. I also saw this post link and I tried this 2 solutions:
solution A:
rules <- rules[!is.redundant(rules)]
but the result is always the same:
inspect(head(rules))
lhs rhs support confidence lift count
[1] {252-ON-OFF} => {L30-ATLANTIC} 0.04545455 1 22 1
[2] {L30-ATLANTIC} => {252-ON-OFF} 0.04545455 1 22 1
[3] {252-ON-OFF} => {M01-A molle biconiche} 0.04545455 1 22 1
[4] {M01-A molle biconiche} => {252-ON-OFF} 0.04545455 1 22 1
[5] {L30-ATLANTIC} => {M01-A molle biconiche} 0.04545455 1 22 1
[6] {M01-A molle biconiche} => {L30-ATLANTIC} 0.04545455 1 22 1
Solution B:
# find redundant rules
subset.matrix <- is.subset(rules, rules)
subset.matrix[lower.tri(subset.matrix, diag=T)]
redundant <- colSums(subset.matrix, na.rm=T) > 1
which(redundant)
rules.pruned <- rules[!redundant]
inspect(rules.pruned)
lhs rhs support confidence lift count
[1] {} => {BRC-BRC} 0.04545455 0.04545455 1 1
[2] {} => {111-WINK} 0.04545455 0.04545455 1 1
[3] {} => {305-INGRAM HIGH} 0.04545455 0.04545455 1 1
[4] {} => {952-REVERS} 0.04545455 0.04545455 1 1
[5] {} => {002-LC2} 0.09090909 0.09090909 1 2
[6] {} => {252-ON-OFF} 0.04545455 0.04545455 1 1
[7] {} => {L30-ATLANTIC} 0.04545455 0.04545455 1 1
[8] {} => {M01-A molle biconiche} 0.04545455 0.04545455 1 1
[9] {} => {678-Portovenere} 0.04545455 0.04545455 1 1
[10] {} => {251-MET T.} 0.04545455 0.04545455 1 1
[11] {} => {324-D.S.3} 0.04545455 0.04545455 1 1
[12] {} => {L04-YUME} 0.04545455 0.04545455 1 1
[13] {} => {969-Lubekka} 0.04545455 0.04545455 1 1
[14] {} => {000-FUORI LISTINO} 0.04545455 0.04545455 1 1
[15] {} => {007-LC7} 0.04545455 0.04545455 1 1
[16] {} => {341-COS} 0.04545455 0.04545455 1 1
[17] {} => {601-ROBIE 1} 0.04545455 0.04545455 1 1
[18] {} => {608-TALIESIN 2} 0.04545455 0.04545455 1 1
[19] {} => {610-ROBIE 2} 0.04545455 0.04545455 1 1
[20] {} => {615-HUSSER} 0.04545455 0.04545455 1 1
[21] {} => {831-DAKOTA} 0.04545455 0.04545455 1 1
[22] {} => {997-997} 0.27272727 0.27272727 1 6
[23] {} => {412-CAB} 0.09090909 0.09090909 1 2
[24] {} => {S01-A doghe senza movimenti} 0.09090909 0.09090909 1 2
[25] {} => {708-Genoa} 0.09090909 0.09090909 1 2
[26] {} => {998-998} 0.54545455 0.54545455 1 12
Has anyone had the same problem and knows how to solve it? Thanks for your help
Upvotes: 4
Views: 3739
Reputation: 3075
The issue is your dataset, not the algorithm. In the result, you see that the count of many rules is 1 (item combination occurs once in the transactions) and confidence is 1 for the rule and its "inverse." This means that you need more data and increase the minimum support.
If you still want to get rid of such "duplicate" rules efficiently, then you can do the following:
> library(arules)
> data(Groceries)
> rules <- apriori(Groceries, parameter = list(support = 0.001))
> rules
set of 410 rules
> gi <- generatingItemsets(rules)
> d <- which(duplicated(gi))
> rules[-d]
set of 385 rules
The code only keeps the first rule of each set of rules with exactly the same items.
Upvotes: 3
Reputation: 13354
You can do it with brute force, by converting your rules object into a data.frame, and iteratively comparing LHS/RHS transaction vectors. Here is an example using the grocery.csv dataset:
inspect(head(groceryrules))
# convert rules object to data.frame
trans_frame <- data.frame(lhs = labels(lhs(groceryrules)), rhs = labels(rhs(groceryrules)), groceryrules@quality)
# loop through each row of trans_frame
rem_indx <- NULL
for(i in 1:nrow(trans_frame)) {
trans_vec_a <- c(as.character(trans_frame[i,1]), as.character(trans_frame[i,2]))
# for each row evaluated, compare to every other row in trans_frame
for(k in 1:nrow(trans_frame[-i,])) {
trans_vec_b <- c(as.character(trans_frame[-i,][k,1]), as.character(trans_frame[-i,][k,2]))
if(setequal(trans_vec_a, trans_vec_b)) {
# store the index to remove
rem_indx[i] <- i
}
}
}
This gives you a vector of indices that should be removed (because they are duplicate/inverted)
duped_trans <- trans_frame[rem_indx[!is.na(rem_indx)], ]
duped_trans
We can see that it identified the 2 transactions that were duplicates/inverts.
Now we can keep the non-duplicate transactions:
deduped_trans <- trans_frame[-rem_indx[!is.na(rem_indx)], ]
The issue of course is the above algorithm is extremely inefficient. The grocery dataset only has 463 transactions. For any reasonable number of transactions you will need to vectorize the function.
Upvotes: 0