Reputation: 57
I generated a dataset holding two distinct columns: an ID column associated to a customer and another column associated to his/her active products:
head(df_itemList)
ID PRD_LISTE
1 1 A,B,C
3 2 C,D
4 3 A,B
5 4 A,B,C,D,E
7 5 B,A,D
8 6 A,C,D
I only selected customers that own more than one product. In total I have 589.454 rows and there are 16 different products.
Next, I wrote the data.frame into an csv-file like this:
df_itemList$ID <- NULL
colnames(df_itemList) <- c("itemList")
write.csv(df_itemList, "Basket_List_13-08-2020.csv", row.names = TRUE)
Then, I converted the csv-file into a basket format in order to apply the apriori algorithm as implemented in the arules-package.
library(arules)
txn <- read.transactions(file="Basket_List_13-08-2020.csv",
rm.duplicates= TRUE, format="basket",sep=",",cols=1)
txn@itemInfo$labels <- gsub("\"","",txn@itemInfo$labels)
The summary-function yields the following output:
summary(txn)
transactions as itemMatrix in sparse format with
589455 rows (elements/itemsets/transactions) and
1737 columns (items) and a density of 0.0005757052
most frequent items:
A,C A,B C,F C,D
57894 32150 31367 29434
A,B,C (Other)
29035 409575
element (itemset/transaction) length distribution:
sizes
1
589455
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1 1 1 1 1
includes extended item information - examples:
labels
1 G,H,I,A,B,C,D,F,J
2 G,H,I,A,B,C,F
3 G,H,I,A,B,K,D
includes extended transaction information - examples:
transactionID
1
2 1
3 3
Now, I tried to run the apriori-algorithm:
basket_rules <- apriori(txn, parameter = list(sup = 1e-15,
conf = 1e-15, minlen = 2, target="rules"))
This is the output:
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
0.01 0.1 1 none FALSE TRUE 5 1e-15 2 10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 0
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[1737 item(s), 589455 transaction(s)] done [0.20s].
sorting and recoding items ... [1737 item(s)] done [0.00s].
creating transaction tree ... done [0.16s].
checking subsets of size 1 done [0.00s].
writing ... [0 rule(s)] done [0.00s].
creating S4 object ... done [0.04s].
Even with a ridiculously low support and confidence, no rules are generated...
summary(basket_rules)
set of 0 rules
Is this really because of my dataset? Or was there a mistake in my code?
Upvotes: 1
Views: 1119
Reputation: 57
@Michael I am quite positive now that there is something wrong with the .csv-file I am reading in. Since there are others who experienced similar problems my guess is that this is the common reason for error. Can you please describe how the .csv-file should look like when read in?
When typing in data <- read.csv("file.csv", header = TRUE, sep = ",")
I get the following data.frame:
X Prd
1 A
2 A,B
3 B,A
4 B
5 C
Is it correct that - if there are multiple products for a customer X - these products are all written in a single column? Or should be written in different columns?
Furthermore, when writing txn <- read.transactions(file="Versicherungen2_ItemList_Short.csv", rm.duplicates= TRUE, format="basket",sep=",",cols=1, skip=1)
and summary(txn)
I see the following problem:
most frequent items:
A B C A,B B,A
1256 1235 456 235 125
(numbers are chosen randomly)
So the read.transaction function differentiates between A,B and B,A... So I am guessing there is something wrong with the .csv-file.
Upvotes: 0
Reputation: 3050
Your summary shows that the data is not read in correctly:
most frequent items:
A,C A,B C,F C,D
57894 32150 31367 29434
A,B,C (Other)
29035 409575
Looks like "A,C" is read as an item, but it should be two items "A" and "C". The separating character does not work. I assume that could be because of quotation marks in the file. Make sure that Basket_List_13-08-2020.csv
looks correct. Also, you need to skip the first line (headers) using skip = 1
when you read the transactions.
Upvotes: 1