Arne
Arne

Reputation: 57

R: Apriori Algorithm does not find any association rules

I generated a dataset holding two distinct columns: an ID column associated to a customer and another column associated to his/her active products:

head(df_itemList)

      ID      PRD_LISTE
1     1       A,B,C
3     2       C,D
4     3       A,B
5     4       A,B,C,D,E
7     5       B,A,D
8     6       A,C,D

I only selected customers that own more than one product. In total I have 589.454 rows and there are 16 different products.

Next, I wrote the data.frame into an csv-file like this:

df_itemList$ID <- NULL
colnames(df_itemList) <- c("itemList")
write.csv(df_itemList, "Basket_List_13-08-2020.csv", row.names = TRUE)

Then, I converted the csv-file into a basket format in order to apply the apriori algorithm as implemented in the arules-package.

library(arules)  
txn <- read.transactions(file="Basket_List_13-08-2020.csv", 
                         rm.duplicates= TRUE, format="basket",sep=",",cols=1)
txn@itemInfo$labels <- gsub("\"","",txn@itemInfo$labels)

The summary-function yields the following output:

summary(txn)
transactions as itemMatrix in sparse format with
 589455 rows (elements/itemsets/transactions) and
 1737 columns (items) and a density of 0.0005757052 

most frequent items:
                   A,C                    A,B                     C,F                     C,D
                  57894                   32150                   31367                   29434 
                  A,B,C                 (Other) 
                  29035                  409575 

element (itemset/transaction) length distribution:
sizes
     1 
589455 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       1       1       1       1       1 

includes extended item information - examples:
                                                                             labels
1 G,H,I,A,B,C,D,F,J
2 G,H,I,A,B,C,F
3 G,H,I,A,B,K,D

includes extended transaction information - examples:
  transactionID
1              
2             1
3             3

Now, I tried to run the apriori-algorithm:

basket_rules <- apriori(txn, parameter = list(sup = 1e-15, 
                                              conf = 1e-15, minlen = 2, target="rules"))

This is the output:

   Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen maxlen target  ext
       0.01    0.1    1 none FALSE            TRUE       5   1e-15      2     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 0 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[1737 item(s), 589455 transaction(s)] done [0.20s].
sorting and recoding items ... [1737 item(s)] done [0.00s].
creating transaction tree ... done [0.16s].
checking subsets of size 1 done [0.00s].
writing ... [0 rule(s)] done [0.00s].
creating S4 object  ... done [0.04s].

Even with a ridiculously low support and confidence, no rules are generated...

summary(basket_rules)
set of 0 rules

Is this really because of my dataset? Or was there a mistake in my code?

Upvotes: 1

Views: 1119

Answers (2)

Arne
Arne

Reputation: 57

@Michael I am quite positive now that there is something wrong with the .csv-file I am reading in. Since there are others who experienced similar problems my guess is that this is the common reason for error. Can you please describe how the .csv-file should look like when read in?

When typing in data <- read.csv("file.csv", header = TRUE, sep = ",") I get the following data.frame:

X     Prd
1     A
2     A,B
3     B,A
4     B
5     C

Is it correct that - if there are multiple products for a customer X - these products are all written in a single column? Or should be written in different columns?

Furthermore, when writing txn <- read.transactions(file="Versicherungen2_ItemList_Short.csv", rm.duplicates= TRUE, format="basket",sep=",",cols=1, skip=1) and summary(txn) I see the following problem:

most frequent items:
A             B            C           A,B            B,A
1256          1235         456         235            125

(numbers are chosen randomly)

So the read.transaction function differentiates between A,B and B,A... So I am guessing there is something wrong with the .csv-file.

Upvotes: 0

Michael Hahsler
Michael Hahsler

Reputation: 3050

Your summary shows that the data is not read in correctly:

most frequent items:
                   A,C                    A,B                     C,F                     C,D
                  57894                   32150                   31367                   29434 
                  A,B,C                 (Other) 
                  29035                  409575 

Looks like "A,C" is read as an item, but it should be two items "A" and "C". The separating character does not work. I assume that could be because of quotation marks in the file. Make sure that Basket_List_13-08-2020.csv looks correct. Also, you need to skip the first line (headers) using skip = 1 when you read the transactions.

Upvotes: 1

Related Questions