Magic8ball
Magic8ball

Reputation: 145

Applying rules generated from arules in R to new transactions

My goal is to use rules generated by the R package arules to predict the topic of each transaction (each transaction has 1 topic), where each transaction is the set of words in a document. I have a training set trans.train (used to create the rules), and test set trans.test (which I want to predict the "topic" of). I also would like to be able to test these predictions (the percentage of times the right hand side of the rule is the correct topic).

I am able to ensure that the right hand side of each rule is a topic (like topic=earn) and the left hand side is any other word in the document. So all of my rules have the form:

{word1,...,wordN} -> {topic=topic1}

I have sorted the rules and want to apply them to trans.test so that the rule with the highest confidence predicts the right hand side, but I can't figure out how to do this based on the documentation.

Are there any ideas on how I might implement this? I have seen the arulesCBA package, but it implements a more complex algorithm whereas I only want to use the highest confidence rule as my predictor of the topic.

Code that generates the transactions:

library(arules)
#load data into R
filename = "C:/Users/sterl_000/Desktop/lab2file.csv"
data = read.csv(filename,header=TRUE,sep="\t")
#Get the number of columns in the matrix
col = dim(data)[2]
#Turn into logical matrix
data[,2:col]=(data[,2:col]>0)

#define % of training and test set
train_pct = 0.8
bound <- floor((nrow(data)*train_pct))    
#randomly permute rows
data <- data[sample(nrow(data)), ]   
#get training data    
data.train <- data[1:bound, ]
#get test data             
data.test <- data[(bound+1):nrow(data),]

#Turn into transaction format
trans.train = as(data.train,"transactions")
trans.test = as(data.test,"transactions")
#Create list of unique topics in 'topic=earn' format
#Allows us to specify only the topic label as the right hand side
uni_topics = paste0('topic=',unique(data[,1]))

#Get assocation rules
rules = apriori(trans.train, 
    parameter=list(support = 0.02,target= "rules", confidence = 0.5), 
    appearance = list(rhs = uni_topics,default='lhs'))

#Sort association rules by confidence
rules = sort(rules,by="confidence")

#Predict the right hand side, topic= in trans.train based on the sorted rules

An example transaction:

> inspect(trans.train[3])

    items          transactionID
[1] {topic=coffee,              
     current,                   
     meet,                      
     group,                     
     statement,                 
     quota,                     
     organ,                     
     brazil,                    
     import,                    
     around,                    
     five,                      
     intern,                    
     produc,                    
     coffe,                     
     institut,                  
     reduc,                     
     intent,                    
     consid}                8760 

An example rule:

> inspect(rules[1])
    lhs       rhs          support    confidence lift    
[1] {qtli} => {topic=earn} 0.03761135 1          2.871171

Upvotes: 4

Views: 3279

Answers (2)

Ian Stenbit
Ian Stenbit

Reputation: 46

In it's upcoming release, the R package arulesCBA supports this type of functionality, should you ever need it again in the future.

In the current development version, arulesCBA has a functon called CBA_ruleset which accepts a sorted set of rules and returns a CBA classifer object.

Upvotes: 1

AutoMiner
AutoMiner

Reputation: 80

I doubt that association rules for words and a simple confidence measure are ideal for predicting document topics.

That being said, try using the is.subset function. I can't reproduce your example without the .csv file, but the following code should give you your predicted topic for trans.train[3] based on the highest confidence.

# sort rules by conf (you already did that but for the sake of completeness)
rules<-sort(rules, decreasing=TRUE, by="confidence")

# find all rules whose lhs matches the training example
rulesMatch <- is.subset(rules@lhs,trans.train[3])

# subset all applicable rules
applicable <- rules[rulesMatch==TRUE]

# the first rule has the highest confidence since they are sorted
prediction <- applicable[1]
inspect(prediction@rhs)

Upvotes: 3

Related Questions