sshussain270
sshussain270

Reputation: 1865

How to find the longest common substring of multiple strings?

I am writing a python script where I have multiple strings.

For example:

x = "brownasdfoersjumps"
y = "foxsxzxasis12sa[[#brown"
z = "thissasbrownxc-34a@s;"

In all these three strings, they have one sub string in common which is brown. I want to search it in a way that I want to create a dictionary as:

dict = {[commonly occuring substring] => 
           [total number of occurrences in the strings provided]}

What would be the best way of doing that? Considering that I will have more than 200 strings each time, what would be an easy/efficient way of doing it?

Upvotes: 5

Views: 7818

Answers (2)

Safir Motiwala
Safir Motiwala

Reputation: 121

I have used a straightforward method to get the common sub sequences from multiple strings. Although the code can be further optimised.

import itertools
 
def getMaxOccurrence(stringsList, key):
    count = 0
    for word in stringsList:
        if key in word:
            count += 1
    return count

def getSubSequences(STR):
    combs = []
    result = []
    for l in range(1, len(STR)+1):
        combs.append(list(itertools.combinations(STR, l)))

    for c in combs:
        for t in c:
            result.append(''.join(t))
    return result

def getCommonSequences(S):
    mainList = []
    for word in S:
        temp = getSubSequences(word)
        mainList.extend(temp)
    
    mainList = list(set(mainList))
    mainList = reversed(sorted(mainList, key=len))
    mainList = list(filter(None, mainList))

    finalData = dict()

    for alpha in mainList:
        val = getMaxOccurrence(S, alpha)
        if val > 0:
            finalData[alpha] = val

    finalData = {k: v for k, v in sorted(finalData.items(), key=lambda item: item[1], reverse=True)}

    return finalData


stringsList = ['abc', 'cab', 'dfab', 'xz']
seqs = getCommonSequences(stringsList)
print(seqs)

Upvotes: 0

Eli Korvigo
Eli Korvigo

Reputation: 10483

This is a relatively optimised naïve algorithm. You first transform each sequence into a set of all its ngrams. Then you intersect all sets and find the longest ngram in the intersection.

from functools import partial, reduce
from itertools import chain
from typing import Iterator


def ngram(seq: str, n: int) -> Iterator[str]:
    return (seq[i: i+n] for i in range(0, len(seq)-n+1))


def allngram(seq: str) -> set:
    lengths = range(len(seq))
    ngrams = map(partial(ngram, seq), lengths)
    return set(chain.from_iterable(ngrams))


sequences = ["brownasdfoersjumps",
             "foxsxzxasis12sa[[#brown",
             "thissasbrownxc-34a@s;"]

seqs_ngrams = map(allngram, sequences)
intersection = reduce(set.intersection, seqs_ngrams)
longest = max(intersection, key=len) # -> brown

While this might get you through short sequences, this algorithm is extremely inefficient on long sequences. If your sequences are long, you can add a heuristic to limit the largest possible ngram length (i.e. the longest possible common substring). One obvious value for such a heuristic may be the shortest sequence's length.

def allngram(seq: str, minn=1, maxn=None) -> Iterator[str]:
    lengths = range(minn, maxn) if maxn else range(minn, len(seq))
    ngrams = map(partial(ngram, seq), lengths)
    return set(chain.from_iterable(ngrams))


sequences = ["brownasdfoersjumps",
             "foxsxzxasis12sa[[#brown",
             "thissasbrownxc-34a@s;"]

maxn = min(map(len, sequences))
seqs_ngrams = map(partial(allngram, maxn=maxn), sequences)
intersection = reduce(set.intersection, seqs_ngrams)
longest = max(intersection, key=len)  # -> brown

This may still take too long (or make your machine run out of RAM), so you might want to read about some optimal algorithms (see the link I left in my comment to your question).

Update

To count the number of strings wherein each ngram occurs

from collections import Counter
sequences = ["brownasdfoersjumps",
             "foxsxzxasis12sa[[#brown",
             "thissasbrownxc-34a@s;"]

seqs_ngrams = map(allngram, sequences)
counts = Counter(chain.from_iterable(seqs_ngrams))

Counter is a subclass of dict, so its instances have similar interfaces:

print(counts)
Counter({'#': 1,
         '#b': 1,
         '#br': 1,
         '#bro': 1,
         '#brow': 1,
         '#brown': 1,
         '-': 1,
         '-3': 1,
         '-34': 1,
         '-34a': 1,
         '-34a@': 1,
         '-34a@s': 1,
         '-34a@s;': 1,
         ...

You can filter the counts to leave substrings occurring in at least n strings: {string: count for string, count in counts.items() if count >= n}

Upvotes: 9

Related Questions