Common patterns in a database

Question

I need to find common patterns in a database of sequences of events. So, I have considered the longest common substring problem and the python implementation searching for a solution.

Note that I am not searching for the longest common substring only: I accept shorter common substrings appearing frequently in the database.

Can you suggest some algorithm, implementation tricks or general advice about this problem?

Phil · Accepted Answer

The previous answer suggested Apriori. But Apriori is inappropriate if you want to find frequent sequences because Apriori does not consider the time (also, Apriori is an inefficient algorithm).

If you want to find subsequences that are common to several sequences, it would be more appropriate to use a sequential pattern mining algorithm such as PrefixSpan and SPAM.

If you want to make some predictions, another option would also be to use a sequential rule mining algorithm.

I have open-source Java implementations of sequential pattern mining and sequential rule mining algorithm that you can download from my website: http://www.philippe-fournier-viger.com/spmf/

I don't think that you could process 8 GB of data in one shot with these algorithms. But it could be a starting point. Actually, some of these algorithms could be adapted for the case of very large databases by implementing a disk-based strategy.

Common patterns in a database

Answers (2)

Related Questions