Shawn Janzen
Shawn Janzen

Reputation: 439

Association Rules in R - Trying to understand two similar datasets but one has rules but not the other

Data: I have two mini example datasets in R called alpha and bravo. Each has 199 rows, 4 columns, and all values are binary. The distributions of 0 and 1 values differ between alpha and bravo, but they're ballpark approximate. Reproducible data code at the end of this post.

Goal: Produce the associated rules with LHS, RHS, Support, Confidence, and Lift Ratio values for each dataset.

Problem: I have two questions / asks.

Work thus far:

  1. Double-checked the data values are indeed the same data type and dimensions.
  2. Run and compared apriori with both datasets input into the function as binary, logical, and transaction formats.
  3. Adjusted apriori parameters to lower the Support and Confidence values in case the alpha's 'no rules' results were a threshold issue.

Understanding apriori in R: From what I've read in the apriori docs, I should be fine inputting my data in binary, logical, or transaction formats; however, the first two will be coerced to transaction when processed within apriori function. I also noted the warning that such coercion may cause issues if the data is not "well behaved", in relation to the itemCoding and discretizeDF functions but yet haven't pinpointed how that would tie to everything I'm seeing.

Data sneak peak (Repro code below)

alpha[1:3,] bravo[1:3,]
a1 a2 a3 a4 b1 b2 b3 b4
1 0 0 1 0 0 1 1
1 0 0 1 1 1 1 1
0 0 0 1 0 0 1 1

Example association rule code & outputs

# create rules with three data input options
# alpha
a.bin.rules <- apriori(alpha)  # binary
a.log.rules <- apriori(alpha>0.5) # logical
a.tra.rules <- apriori(as(alpha, "transactions")) # transactions input
# bravo
b.bin.rules <- apriori(bravo)  # binary
b.log.rules <- apriori(bravo>0.5) # logical 
b.tra.rules <- apriori(as(bravo, "transactions")) # transactions

Apriori quality outputs were the same for all 6:


Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen maxlen target  ext
        0.8    0.1    1 none FALSE            TRUE       5     0.1      1     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Apriori remaining output for alpha binary data input:

Absolute minimum support count: 19 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[4 item(s), 199 transaction(s)] done [0.00s].
sorting and recoding items ... [4 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [32 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].

Apriori remaining output was:

Only the bravo dataset with the logial data format input worked as expected. Results truncated here to save space. Showing mid-output.

     lhs             rhs  support   confidence coverage  lift     count
[6]  {b1, b2}     => {b3} 0.1306533 0.8666667  0.1507538 1.077917  26  
[7]  {b1, b2}     => {b4} 0.1457286 0.9666667  0.1507538 1.241075  29  
[8]  {b2, b3}     => {b4} 0.2713568 0.9310345  0.2914573 1.195328  54  

All other inspected rules had =[0,1] following each variable name, and rule metrics all equaled 1. Why is this? Results truncated here to save space. Showing mid-output. Example below is from alpha binary and was the same for alpha transactions. Bravo only differed by changing the variable letters from a to b.

     lhs                               rhs        support confidence coverage lift count
[17] {a1=[0,1], a2=[0,1]}           => {a3=[0,1]} 1       1          1        1    199  
[18] {a1=[0,1], a3=[0,1]}           => {a2=[0,1]} 1       1          1        1    199  
[19] {a2=[0,1], a3=[0,1]}           => {a1=[0,1]} 1       1          1        1    199  

I decreased Support and Confidence as low as 0.01 each and the results were the same for all versions. The only exception was bravo logical, which still worked and just had a up to 32 working rules. Code update examples:

b.bin.rules <- apriori(bravo, parameter = list(supp = 0.01, conf = 0.01))  # binary
b.log.rules <- apriori(bravo>0.5, parameter = list(supp = 0.01, conf = 0.01)) # logical 
b.tra.rules <- apriori(as(bravo, "transactions"), parameter = list(supp = 0.01, conf = 0.01)) # transactions

Dataset comparison

# Sparsity
sum(as.matrix(alpha) == 0) / length(as.matrix(alpha))
[1] 0.5226131
sum(as.matrix(bravo) == 0) / length(as.matrix(bravo))
[1] 0.4120603

# Column totals
 a1  a2  a3  a4 
 91 115  52 122 
 b1  b2  b3  b4 
 88  65 160 155 

# Row total sums
table( rowSums(alpha) )
 0  1  2  3  4 
10 56 82 44  7 
table( rowSums(bravo) )
 0  1  2  3  4 
 9 29 69 67 25

# Cross-Data Column Correlations
round(sapply(1:ncol(alpha), function(i) cor(alpha[, i], bravo[, i])), 5)
[1] -0.06583 -0.16407  0.00550 -0.00062

# Similarity comparison by element
comparison <- alpha == bravo
sim_count <- sum(comparison)
(sim_count / (nrow(alpha) * ncol(alpha))) * 100
[1] 44.72362

Reproducible datasets

alpha <- structure(list(a1 = c(1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 
0L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 
0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 
0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 
0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 
1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 
1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 
0L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 
1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 
1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 
1L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L), a2 = c(0L, 
0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 
1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 
0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 
0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 
1L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 
0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 
1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 
0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 
0L, 1L, 0L, 1L, 0L, 0L), a3 = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 
1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 
1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 
0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 
0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 
1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 
0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 
1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 
0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 0L), 
    a4 = c(1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 
    1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 
    0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 
    1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 
    1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 
    1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 
    0L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 
    0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 
    0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 
    1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 
    1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 
    0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 
    1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 
    0L, 1L, 0L, 1L, 1L, 1L)), row.names = c(NA, -199L), class = "data.frame")

bravo <- structure(list(b1 = c(0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 
0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 
0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 
0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 
0L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 
0L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 
1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 
0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 
0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 
1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), b2 = c(0L, 
1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 
1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 
1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 
0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 
0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 
1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 
0L, 0L, 1L, 1L, 1L, 0L), b3 = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 
1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 
0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 
1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 
1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L), 
    b4 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 
    1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 
    1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 
    1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 
    1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 
    1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 
    0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 
    0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 
    0L, 0L, 1L, 0L, 1L, 1L)), class = "data.frame", row.names = c(NA, 

Upvotes: 0

Views: 53

Answers (1)

Michael Hahsler
Michael Hahsler

Reputation: 3075

Your data is encoded as 0 and 1. arules needs this encoded as TRUE and FALSE, so using > .5 is the right step. Also, the rules in the data have very low confidence, so you need to change the default of .8. Here is code to create rules for alpha. You can use similar code for bravo.

> # create transactions and make sure they look OK
> tr_alpha <- as(alpha > .5, "transactions")
> summary(tr_alpha)

transactions as itemMatrix in sparse format with
 199 rows (elements/itemsets/transactions) and
 4 columns (items) and a density of 0.4773869 

most frequent items:
     a4      a2      a1      a3 (Other) 
    122     115      91      52       0 

element (itemset/transaction) length distribution:
 0  1  2  3  4 
10 56 82 44  7 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    1.00    2.00    1.91    3.00    4.00 

includes extended item information - examples:
1     a1
2     a2
3     a3

> # mine rules with a reduced confidence threshold
> rules <- apriori(tr_alpha, support = 0.1, confidence = .5)


Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen maxlen target  ext
        0.5    0.1    1 none FALSE            TRUE       5     0.1      1     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 19 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[4 item(s), 199 transaction(s)] done [0.00s].
sorting and recoding items ... [4 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [11 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].

> inspect(rules)
     lhs         rhs  support   confidence coverage  lift      count
[1]  {}       => {a2} 0.5778894 0.5778894  1.0000000 1.0000000 115  
[2]  {}       => {a4} 0.6130653 0.6130653  1.0000000 1.0000000 122  
[3]  {a3}     => {a1} 0.1306533 0.5000000  0.2613065 1.0934066  26  
[4]  {a3}     => {a2} 0.1407035 0.5384615  0.2613065 0.9317726  28  
[5]  {a3}     => {a4} 0.1457286 0.5576923  0.2613065 0.9096784  29  
[6]  {a1}     => {a2} 0.2412060 0.5274725  0.4572864 0.9127568  48  
[7]  {a1}     => {a4} 0.3015075 0.6593407  0.4572864 1.0754819  60  
[8]  {a2}     => {a4} 0.3266332 0.5652174  0.5778894 0.9219530  65  
[9]  {a4}     => {a2} 0.3266332 0.5327869  0.6130653 0.9219530  65  
[10] {a1, a2} => {a4} 0.1507538 0.6250000  0.2412060 1.0194672  30  
[11] {a1, a4} => {a2} 0.1507538 0.5000000  0.3015075 0.8652174  30

Upvotes: 1

Related Questions