Reputation: 359
Trying to use arulesSequences
packages in R. Running into the problem I've seen a lot of people encounter but no good answers for: going from data-frame or matrix to transaction data type.
I've done this, as the documentation clearly states, for arules:
a_df3 <- data.frame(TID = c(1,1,2,2,2,3), item=c("a","b","a","b","c", "b"))
a_df3
trans4 <- as(split(a_df3[,"item"], a_df3[,"TID"]), "transactions")
Works okay. But if I try to do the same for a 3 column dataframe, everything goes haywire:
a_df4<-data.frame(SEQUENCEID=c("1","1","1","2","2","3","3"),
EVENTID=c("1","2","3","1","2","1","2"),
ITEM=c("a","b","a","c","a","a","b"))
a_df4
SEQUENCEID EVENTID ITEM
1 1 1 a
2 1 2 b
3 1 3 a
4 2 1 c
5 2 2 a
6 3 1 a
7 3 2 b
Yes, there are duplicates but this is exactly the point isn't it? (to find frequent sets of sequences).
So, now I coerce like such:
seqt<-as(split(myseq[,"ITEM"],myseq[,"SEQUENCEID"],myseq[,"EVENTID"]),"transactions")
And I get:
Error in asMethod(object) :
can not coerce list with transactions with duplicated items
I've been all over the place trying to get thru this simple hurdle:
All errors are either the one described above or when I don't get any I get a transaction object with two columns, which OF COURSE cannot be read by arulesSequences
because it needs three columns: 1) SEQUENCE-ID, EVENT-ID, ITEMS.
I don't think my data base structure could be any clearer. The sequence is "costumer number", the event id would be the purchase number and the items, well, items.
Please any help appreciated including the structures "as()" wants to see so that it does the coercing correctly.
Upvotes: 2
Views: 2635
Reputation: 65
try this:
trans4 <- as(a_df3[,"item"], "transactions")
trans4@itemsetInfo$sequnceID = a_df3$SEQUENCEID
trans4@itemsetInfo$eventID = a_df3$EVENTID
transSeq = as(trans4, "timedsequences")
Upvotes: 2
Reputation: 37
Its been a while that this ques was asked, but I'll try to answer it anyways. The error seems to be because there are identical records of the following type
SEQUENCEID EVENTID ITEM
1 1 1 a
3 1 1 a
4 2 1 c
This might solve the problem if you check for distinct records before split and converting to transactions.
Upvotes: 0
Reputation: 77454
arules treats transactions as sets not as sequences.
It can detect frequent itemsets but probably not sequences.
Checking for duplicates is a safeguard against using it incorrectly: it ignores multiplicity and sequence, so having more than one item of the same kind is lost information.
The transactions class represents transaction data used for mining itemsets or rules. It is a direct extension of class itemMatrix to store a binary incidence matrix, item labels, and optionally transaction IDs and user IDs.
(from the documentation, emphasis added)
Upvotes: 0