Reputation: 147
This is a long question so here is a TLDR version:
I am implementig apriori algorithm and it is slow. the slow part is when I am trying to generate Lk
form Ck
and it has to scan the whole dataset (more or less) to find out if the candidate is frequent or not. How can I make this part faster? Is there any data structure that accelerate
this reapeated search through dataset?
I have an assignment to write apriori algorithm with python. The constrains are not to use pandas
and not to use any module that implements aperiori like apyori
. So generating C1
and L1
are not a problem at all. Generating Ck
from Lk-1
is OK too. the bottleneck of the whole thing is generating Lk
from Ck
. this section will compare every candidate against the whole dataset which takes ages for small minimum_supports.
I spent some time searching for improved versions of apriori and among those that I could understand, this one was the best (IMO): https://arxiv.org/pdf/1403.3948.pdf
this paper offers keeping a list of transactions for each item that indicates in what transactions, that item/itemset appeared (let's call it found_in
). Having that in hand, we can have a reduced number of search when generating Lk
from Ck
since we can only scan elements that are mentioned in that list (found_in
).
I implemented it and it reduced the time by 4 times or so which is amazing. Yet, it is not fast enough for the assignment since I am supposed to extract 40,000 frequent patterns.
So I am thinking maybe the algorithm is good but python's data structures that I am using are too slow to catch up. So here are my questions:
ctype
maybe?apyori
)I know this question takes a huge time to investigate properly and I'm not expecting it. So any small Tip is appreciated.
the part of code that is slow:
def gen_Lk(Ck: dict, dataset: list, min_support_count: int) -> dict:
subset_time = 0
Lk = {}
for candidate, TIDs in Ck.items():
if len(TIDs) < min_support_count:
continue
new_TIDs = set()
tt = time()
for TID in TIDs:
comp_transaction = dataset[TID]
# equivalent to: if candidate.issubset(dataset[TID])
# this is the slowest part of the code and this is how to make it a
# bit faster
if not any(item not in comp_transaction for item in candidate):
new_TIDs.add(TID)
if len(new_TIDs) < min_support_count:
continue
Lk[candidate] = new_TIDs
return Lk
the whole code (sorry for not commenting well):
from itertools import combinations
import pickle
from time import time
def has_infrequent_subset(candidate: set, previous_L: list) -> bool:
"""
A function to prone some of candidates
Parameters
----------
candidate -- a set to check whether all of its subsets are frequent or not.
if any subset is not frequent, the function will returns True,
otherwise returns False.
previous_L -- a list of tuples to check candidate's subsets against it.
an instance of previous_L could be found in 'Examples' part.
Returns
-------
a boolean value. True means there are some subsets in the candidate that
are not frequent with respect to previous_L and this value should no be
included in the final Ck result. False means all subsets are frequent and
we shall include this candidate in our Ck result.
Examples
--------
>>> previous_L = [(1,2,4),(2,3,6),(2,3,8),(2,6,7),(2,6,8),(3,4,5),(3,6,8)]
>>> has_infrequent_subset((2,3,6,8), previous_L)
False
>>> has_infrequent_subset((2,3,6,7), previous_L)
True
"""
subsets = combinations(candidate, len(candidate)-1)
for subset in subsets: # subset is a tuple
if subset not in previous_L:
return True
return False
def apriori_gen(previous_L: dict) -> dict:
"""
A function generate candidates with respect to Lk-1 (previous_L). tries
prone the results with the help of has_infrequent_subset(). for every new
candidate found, if all of its subsets with the length of k-1 are not
frequent in Lk-1 (previous_L), it will not be added to the result.
"""
Ck = {}
for item_1, TIDs1 in previous_L.items():
for item_2, TIDs2 in previous_L.items():
if item_1[:-1] == item_2[:-1] and item_1[-1] < item_2[-1]:
new_item = tuple([*item_1, item_2[-1]])
if has_infrequent_subset(new_item, previous_L):
continue
new_TIDs = TIDs1 if len(TIDs1) < len(TIDs2) else TIDs2
Ck[new_item] = new_TIDs
return Ck
def generate_L1(dataset: list, min_support_count: int) -> dict:
"""
Generates L1 itemset from given dataset with respect to min_support_count
Parameters
----------
dataset -- a list of lists. each inner list represent a transaction which
its content are items bought in that transacton. the outer list is the
dataset which contain all transactions.
min_support_count -- an integer which is used to check whether one item is
frequent or not.
Returns
-------
a dictionary with keys representing L1 frequent items fount and values
representing what transactions that item appeared in. the values are sets.
the values will be useful later as this paper demonstrates:
https://arxiv.org/pdf/1403.3948.pdf
Examples
--------
>>> generate_L1([[1,2,3], [3,4,1], [3,4,5]], 3)
{(3,): {0, 1, 2}}
>>> generate_L1([[1,2,3], [3,4,1], [3,4,5]], 2)
{(1,): {0, 1}, (3,): {0, 1, 2}, (4,): {1, 2}}
"""
L1 = {}
for TID, transaction in enumerate(dataset):
for item in transaction:
if (item,) not in L1:
L1[(item,)] = set()
L1[(item,)].add(TID)
return {item: TIDs for item, TIDs in L1.items()
if len(TIDs) >= min_support_count}
def gen_Lk(Ck: dict, dataset: list, min_support_count: int) -> dict:
st = time()
Lk = {}
for candidate, TIDs in Ck.items():
if len(TIDs) < min_support_count:
continue
new_TIDs = set()
tt = time()
for TID in TIDs:
comp_transaction = dataset[TID]
# equivalent to: if candidate.issubset(dataset[TID])
# this is the slowest part of the code and this is how to make it a
# bit faster
if not any(item not in comp_transaction for item in candidate):
new_TIDs.add(TID)
if len(new_TIDs) < min_support_count:
continue
Lk[candidate] = new_TIDs
return Lk
def apriori(min_support_count: int, dataset: list):
st = time()
L1 = generate_L1(dataset, min_support_count)
L = {1: L1}
for k in range(2, 1000):
if len(L[k-1]) < 2:
break
Ck = apriori_gen(L[k-1])
L[k] = gen_Lk(Ck, dataset, min_support_count)
return L
if __name__ == '__main__':
with open(paths.q1.listed_ds, 'rb') as f:
dataset = pickle.load(f)
L = apriori(len(dataset)*0.03, dataset)
result = []
for _, Lk in L.items():
result.extend(Lk.keys())
print(result, len(result))
Upvotes: 0
Views: 2095
Reputation: 330
Probably a little late for your assignment but:
Compute the new TIDS already in apriori_gen
new_TIDs = TIDs1.intersection(TIDs2)
And then just reuse the new TIDs like so
def gen_Lk(Ck: dict, dataset: list, min_support_count: int) -> dict:
Lk = {}
for candidate, newTIDs in Ck.items():
if len(newTIDs) < min_support_count:
continue
Lk[candidate] = new_TIDs
return Lk
Upvotes: 1