Amir
Amir

Reputation: 1087

Orange3 Frequent Itemsets performance

I just realized that the performance of Frequent Itemsets is strongly correlated with number of item per basket. I run the following code:

import datetime
from datetime import datetime
from orangecontrib.associate.fpgrowth import * 
%%time
T = [[1,3, 4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]]    
itemsets = frequent_itemsets(T, 1)
a=list(itemsets)

As I increased the number of item in T the running time increased as follow:

Item    running time
21        3.39s
22        9.14s
23        15.8s 
24        37.4s
25        1.2 min
26        10 min
27        35 min
28        95 min

For 31 item sets it took 10 hours without returning any results. I am wondering if there is anyway to run it for more than 31 items in reasonable time? In this case I just need pairwise item set (A-->B) while my understanding is frequent_itemsets count all possible combination and that is probably why it is running time is highly correlated with number of items. Is there any way to tell the method to limit the number item to count like instead of all combination just pairwise?

Upvotes: 0

Views: 575

Answers (2)

Phil
Phil

Reputation: 3520

You could use other software that allows to specify constraints on the itemsets such as a length constraints. For example, you can consider the SPMF data mining library (disclosure: I am the founder), which offers about 120 algorithms for itemset and pattern mining. It will let you use FPGrowth with a length constraint. So you could for example mine only the patterns with 2 items or 3 items. You could also try other features such as mining association rules. That software works on text file and can be called for the command line, and is quite fast.

Upvotes: 2

K3---rnc
K3---rnc

Reputation: 7049

A database of a one single transaction of 21 items results in 2097151 itemsets.

>>> T = [list(range(21))]
>>> len(list(frequent_itemsets(T, 1)))
2097151

Perhaps instead of absolute support set as low as a single transaction (1), choose the support to be e.g. 5% of all transactions (.05). You should also limit the returned itemsets to contain exactly two items (antecedent and consequent for later association rule discovery), but the runtime will still be high due to, as you understand, sheer combinatorics.

len([itemset
     for itemset, support in frequent_itemsets(T, 1)
     if len(itemset) == 2])

At the moment, there is no such filtering available inside the algorithm, but the source is open to tinkering.

Upvotes: 1

Related Questions