Reputation: 1087
I just realized that the performance of Frequent Itemsets is strongly correlated with number of item per basket. I run the following code:
import datetime
from datetime import datetime
from orangecontrib.associate.fpgrowth import *
%%time
T = [[1,3, 4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]]
itemsets = frequent_itemsets(T, 1)
a=list(itemsets)
As I increased the number of item in T the running time increased as follow:
Item running time
21 3.39s
22 9.14s
23 15.8s
24 37.4s
25 1.2 min
26 10 min
27 35 min
28 95 min
For 31 item sets it took 10 hours without returning any results. I am wondering if there is anyway to run it for more than 31 items in reasonable time? In this case I just need pairwise item set (A-->B) while my understanding is frequent_itemsets count all possible combination and that is probably why it is running time is highly correlated with number of items. Is there any way to tell the method to limit the number item to count like instead of all combination just pairwise?
Upvotes: 0
Views: 575
Reputation: 3520
You could use other software that allows to specify constraints on the itemsets such as a length constraints. For example, you can consider the SPMF data mining library (disclosure: I am the founder), which offers about 120 algorithms for itemset and pattern mining. It will let you use FPGrowth with a length constraint. So you could for example mine only the patterns with 2 items or 3 items. You could also try other features such as mining association rules. That software works on text file and can be called for the command line, and is quite fast.
Upvotes: 2
Reputation: 7049
A database of a one single transaction of 21 items results in 2097151 itemsets.
>>> T = [list(range(21))]
>>> len(list(frequent_itemsets(T, 1)))
2097151
Perhaps instead of absolute support set as low as a single transaction (1
), choose the support to be e.g. 5% of all transactions (.05
).
You should also limit the returned itemsets to contain exactly two items (antecedent and consequent for later association rule discovery), but the runtime will still be high due to, as you understand, sheer combinatorics.
len([itemset
for itemset, support in frequent_itemsets(T, 1)
if len(itemset) == 2])
At the moment, there is no such filtering available inside the algorithm, but the source is open to tinkering.
Upvotes: 1