How to make apriori algorithm faster?

Question

TLDR Question

This is a long question so here is a TLDR version: I am implementig apriori algorithm and it is slow. the slow part is when I am trying to generate Lk form Ck and it has to scan the whole dataset (more or less) to find out if the candidate is frequent or not. How can I make this part faster? Is there any data structure that accelerate this reapeated search through dataset?

Long Question

I have an assignment to write apriori algorithm with python. The constrains are not to use pandas and not to use any module that implements aperiori like apyori. So generating C1 and L1 are not a problem at all. Generating Ck from Lk-1 is OK too. the bottleneck of the whole thing is generating Lk from Ck. this section will compare every candidate against the whole dataset which takes ages for small minimum_supports.

I spent some time searching for improved versions of apriori and among those that I could understand, this one was the best (IMO): https://arxiv.org/pdf/1403.3948.pdf

this paper offers keeping a list of transactions for each item that indicates in what transactions, that item/itemset appeared (let's call it found_in). Having that in hand, we can have a reduced number of search when generating Lk from Ck since we can only scan elements that are mentioned in that list (found_in).

I implemented it and it reduced the time by 4 times or so which is amazing. Yet, it is not fast enough for the assignment since I am supposed to extract 40,000 frequent patterns.

So I am thinking maybe the algorithm is good but python's data structures that I am using are too slow to catch up. So here are my questions:

Can I make this algorightm faster by using a better data structure? Something in ctype maybe?
Is there any problem with my code that makes it hang? The results from this algorithm looks sound to me (comparing with output of apyori)
Any tip to improve it or prone some conditions?

I know this question takes a huge time to investigate properly and I'm not expecting it. So any small Tip is appreciated.

the part of code that is slow:

def gen_Lk(Ck: dict, dataset: list, min_support_count: int) -> dict:
    subset_time = 0
    Lk = {}
    for candidate, TIDs in Ck.items():
        if len(TIDs) < min_support_count:
            continue
        new_TIDs = set()
        tt = time()
        for TID in TIDs:
            comp_transaction = dataset[TID]
            # equivalent to: if candidate.issubset(dataset[TID])
            # this is the slowest part of the code and this is how to make it a
            # bit faster
            if not any(item not in comp_transaction for item in candidate):
                new_TIDs.add(TID)
        if len(new_TIDs) < min_support_count:
            continue
        Lk[candidate] = new_TIDs
    return Lk

the whole code (sorry for not commenting well):

from itertools import combinations
import pickle
from time import time


def has_infrequent_subset(candidate: set, previous_L: list) -> bool:
    """
    A function to prone some of candidates

    Parameters
    ----------
    candidate -- a set to check whether all of its subsets are frequent or not.
        if any subset is not frequent, the function will returns True,
        otherwise returns False.
    previous_L -- a list of tuples to check candidate's subsets against it.
        an instance of previous_L could be found in 'Examples' part.

    Returns
    -------
    a boolean value. True means there are some subsets in the candidate that
    are not frequent with respect to previous_L and this value should no be
    included in the final Ck result. False means all subsets are frequent and
    we shall include this candidate in our Ck result.

    Examples
    --------
    >>> previous_L = [(1,2,4),(2,3,6),(2,3,8),(2,6,7),(2,6,8),(3,4,5),(3,6,8)]
    >>> has_infrequent_subset((2,3,6,8), previous_L)
    False
    >>> has_infrequent_subset((2,3,6,7), previous_L)
    True
    """
    subsets = combinations(candidate, len(candidate)-1)
    for subset in subsets:  # subset is a tuple
        if subset not in previous_L:
            return True
    return False


def apriori_gen(previous_L: dict) -> dict:
    """
    A function generate candidates with respect to Lk-1 (previous_L). tries
    prone the results with the help of has_infrequent_subset(). for every new
    candidate found, if all of its subsets with the length of k-1 are not
    frequent in Lk-1 (previous_L), it will not be added to the result.
    """
    Ck = {}
    for item_1, TIDs1 in previous_L.items():
        for item_2, TIDs2 in previous_L.items():
            if item_1[:-1] == item_2[:-1] and item_1[-1] < item_2[-1]:
                new_item = tuple([*item_1, item_2[-1]])
                if has_infrequent_subset(new_item, previous_L):
                    continue
                new_TIDs = TIDs1 if len(TIDs1) < len(TIDs2) else TIDs2
                Ck[new_item] = new_TIDs
    return Ck


def generate_L1(dataset: list, min_support_count: int) -> dict:
    """
    Generates L1 itemset from given dataset with respect to min_support_count

    Parameters
    ----------
    dataset -- a list of lists. each inner list represent a transaction which
        its content are items bought in that transacton. the outer list is the
        dataset which contain all transactions.
    min_support_count -- an integer which is used to check whether one item is
        frequent or not.

    Returns
    -------
    a dictionary with keys representing L1 frequent items fount and values
    representing what transactions that item appeared in. the values are sets.
    the values will be useful later as this paper demonstrates:
    https://arxiv.org/pdf/1403.3948.pdf

    Examples
    --------
    >>> generate_L1([[1,2,3], [3,4,1], [3,4,5]], 3)
    {(3,): {0, 1, 2}}
    >>> generate_L1([[1,2,3], [3,4,1], [3,4,5]], 2)
    {(1,): {0, 1}, (3,): {0, 1, 2}, (4,): {1, 2}}
    """
    L1 = {}
    for TID, transaction in enumerate(dataset):
        for item in transaction:
            if (item,) not in L1:
                L1[(item,)] = set()
            L1[(item,)].add(TID)
    return {item: TIDs for item, TIDs in L1.items()
            if len(TIDs) >= min_support_count}


def gen_Lk(Ck: dict, dataset: list, min_support_count: int) -> dict:
    st = time()
    Lk = {}
    for candidate, TIDs in Ck.items():
        if len(TIDs) < min_support_count:
            continue
        new_TIDs = set()
        tt = time()
        for TID in TIDs:
            comp_transaction = dataset[TID]
            # equivalent to: if candidate.issubset(dataset[TID])
            # this is the slowest part of the code and this is how to make it a
            # bit faster
            if not any(item not in comp_transaction for item in candidate):
                new_TIDs.add(TID)
        if len(new_TIDs) < min_support_count:
            continue
        Lk[candidate] = new_TIDs
    return Lk


def apriori(min_support_count: int, dataset: list):
    st = time()
    L1 = generate_L1(dataset, min_support_count)
    L = {1: L1}
    for k in range(2, 1000):
        if len(L[k-1]) < 2:
            break
        Ck = apriori_gen(L[k-1])
        L[k] = gen_Lk(Ck, dataset, min_support_count)
    return L


if __name__ == '__main__':
    with open(paths.q1.listed_ds, 'rb') as f:
        dataset = pickle.load(f)
    L = apriori(len(dataset)*0.03, dataset)

    result = []
    for _, Lk in L.items():
        result.extend(Lk.keys())
    print(result, len(result))

Feelx234 · Accepted Answer

Probably a little late for your assignment but:

Compute the new TIDS already in apriori_gen

new_TIDs = TIDs1.intersection(TIDs2)

And then just reuse the new TIDs like so

def gen_Lk(Ck: dict, dataset: list, min_support_count: int) -> dict:
    Lk = {}
    for candidate, newTIDs in Ck.items():
        if len(newTIDs) < min_support_count:
            continue
        Lk[candidate] = new_TIDs
    return Lk

How to make apriori algorithm faster?

TLDR Question

Long Question

Answers (1)

Related Questions