Nicolas NOEL
Nicolas NOEL

Reputation: 891

Longest common substring from more than two strings

I'm looking for a Python library for finding the longest common sub-string from a set of strings. There are two ways to solve this problem:

Method implemented is not important. It is important it can be used for a set of strings (not only two strings).

Upvotes: 68

Views: 49036

Answers (10)

Zachary Vance
Zachary Vance

Reputation: 890

The addition of a single 'break' speeds up jtjacques's answer significantly on my machine (1000X or so for 16K files):

def long_substr(data):
    substr = ''
    if len(data) > 1 and len(data[0]) > 0:
        for i in range(len(data[0])):
            for j in range(len(substr)+1, len(data[0])-i+1):
                if all(data[0][i:i+j] in x for x in data[1:]):
                    substr = data[0][i:i+j]
                else:
                    break
    return substr

Upvotes: 1

Michael
Michael

Reputation: 191

Caveman solution that will give you a dataframe with the top most frequent substring in a string base on the substring length you pass as a list:

import pandas as pd

lista = ['How much wood would a woodchuck',' chuck if a woodchuck could chuck wood?']

string = ''
for i in lista:
    string = string + ' ' + str(i)

string = string.lower()

characters_you_would_like_to_remove_from_string = [' ','-','_']

for i in charecters_you_would_like_to_remove_from_string:
    string = string.replace(i,'')

substring_length_you_want_to_check = [3,4,5,6,7,8]

results_list = []

for string_length in substring_length_you_want_to_check:
    for i in range(len(string)):
        checking_str = string[i:i+string_length]
        if len(checking_str) == string_length:
            number_of_times_appears = (len(string) - len(string.replace(checking_str,'')))/string_length
            results_list = results_list+[[checking_str,number_of_times_appears]]


df = pd.DataFrame(data=results_list,columns=['string','freq'])

df['freq'] = df['freq'].astype('int64')

df = df.drop_duplicates()


df = df.sort_values(by='freq',ascending=False)

display(df[:10])

result is:

    string  freq
78    huck     4
63    wood     4
77    chuc     4
132  chuck     4
8      ood     4
7      woo     4
21     chu     4
23     uck     4
22     huc     4
20     dch     3

Upvotes: 0

Ted Archer
Ted Archer

Reputation: 1

My answer, pretty slow, but very easy to understand. Working on a file with 100 strings of 1 kb each takes about two seconds, returns any one longest substring if there are more than one

ls = list()
ls.sort(key=len)
s1 = ls.pop(0)
maxl = len(s1)

#1 create a list of all substrings backwards sorted by length. Thus we don't have to check the whole list.

subs = [s1[i:j] for i in range(maxl) for j in range(maxl,i,-1)]
subs.sort(key=len, reverse=True)
    

#2 Check a substring with the next shortest then the next etc. if is not in an any next shortest string then break the cycle, it's not common. If it passes all checks, it is the longest one by default, break the cycle.

def isasub(subs, ls):
    for sub in subs:
        for st in ls:
            if sub not in st:
                break 
        else:
            return sub
            break
print('the longest common substring is: ',isasub(subs,ls))

Upvotes: 0

Herbert
Herbert

Reputation: 5645

This can be done shorter:

def long_substr(data):
  substrs = lambda x: {x[i:i+j] for i in range(len(x)) for j in range(len(x) - i + 1)}
  s = substrs(data[0])
  for val in data[1:]:
    s.intersection_update(substrs(val))
  return max(s, key=len)

set's are (probably) implemented as hash-maps, which makes this a bit inefficient. If you (1) implement a set datatype as a trie and (2) just store the postfixes in the trie and then force each node to be an endpoint (this would be the equivalent of adding all substrings), THEN in theory I would guess this baby is pretty memory efficient, especially since intersections of tries are super-easy.

Nevertheless, this is short and premature optimization is the root of a significant amount of wasted time.

Upvotes: 9

Daniel Tripp
Daniel Tripp

Reputation: 111

If someone is looking for a generalized version that can also take a list of sequences of arbitrary objects:

def get_longest_common_subseq(data):
    substr = []
    if len(data) > 1 and len(data[0]) > 0:
        for i in range(len(data[0])):
            for j in range(len(data[0])-i+1):
                if j > len(substr) and is_subseq_of_any(data[0][i:i+j], data):
                    substr = data[0][i:i+j]
    return substr

def is_subseq_of_any(find, data):
    if len(data) < 1 and len(find) < 1:
        return False
    for i in range(len(data)):
        if not is_subseq(find, data[i]):
            return False
    return True

# Will also return True if possible_subseq == seq.
def is_subseq(possible_subseq, seq):
    if len(possible_subseq) > len(seq):
        return False
    def get_length_n_slices(n):
        for i in xrange(len(seq) + 1 - n):
            yield seq[i:i+n]
    for slyce in get_length_n_slices(len(possible_subseq)):
        if slyce == possible_subseq:
            return True
    return False

print get_longest_common_subseq([[1, 2, 3, 4, 5], [2, 3, 4, 5, 6]])

print get_longest_common_subseq(['Oh, hello, my friend.',
                                     'I prefer Jelly Belly beans.',
                                     'When hell freezes over!'])

Upvotes: 1

kiminoa
kiminoa

Reputation: 2709

I prefer this for is_substr, as I find it a bit more readable and intuitive:

def is_substr(find, data):
  """
  inputs a substring to find, returns True only 
  if found for each data in data list
  """

  if len(find) < 1 or len(data) < 1:
    return False # expected input DNE

  is_found = True # and-ing to False anywhere in data will return False
  for i in data:
    print "Looking for substring %s in %s..." % (find, i)
    is_found = is_found and find in i
  return is_found

Upvotes: 4

badp
badp

Reputation: 11833

# this does not increase asymptotical complexity
# but can still waste more time than it saves. TODO: profile
def shortest_of(strings):
    return min(strings, key=len)

def long_substr(strings):
    substr = ""
    if not strings:
        return substr
    reference = shortest_of(strings) #strings[0]
    length = len(reference)
    #find a suitable slice i:j
    for i in xrange(length):
        #only consider strings long at least len(substr) + 1
        for j in xrange(i + len(substr) + 1, length + 1):
            candidate = reference[i:j]  # ↓ is the slice recalculated every time?
            if all(candidate in text for text in strings):
                substr = candidate
    return substr

Disclaimer This adds very little to jtjacques' answer. However, hopefully, this should be more readable and faster and it didn't fit in a comment, hence why I'm posting this in an answer. I'm not satisfied about shortest_of, to be honest.

Upvotes: 2

Rodrigue ALAHASSA
Rodrigue ALAHASSA

Reputation: 15

You could use the SuffixTree module that is a wrapper based on an ANSI C implementation of generalised suffix trees. The module is easy to handle....

Take a look at: here

Upvotes: -1

jtjacques
jtjacques

Reputation: 871

These paired functions will find the longest common string in any arbitrary array of strings:

def long_substr(data):
    substr = ''
    if len(data) > 1 and len(data[0]) > 0:
        for i in range(len(data[0])):
            for j in range(len(data[0])-i+1):
                if j > len(substr) and is_substr(data[0][i:i+j], data):
                    substr = data[0][i:i+j]
    return substr

def is_substr(find, data):
    if len(data) < 1 and len(find) < 1:
        return False
    for i in range(len(data)):
        if find not in data[i]:
            return False
    return True


print long_substr(['Oh, hello, my friend.',
                   'I prefer Jelly Belly beans.',
                   'When hell freezes over!'])

No doubt the algorithm could be improved and I've not had a lot of exposure to Python, so maybe it could be more efficient syntactically as well, but it should do the job.

EDIT: in-lined the second is_substr function as demonstrated by J.F. Sebastian. Usage remains the same. Note: no change to algorithm.

def long_substr(data):
    substr = ''
    if len(data) > 1 and len(data[0]) > 0:
        for i in range(len(data[0])):
            for j in range(len(data[0])-i+1):
                if j > len(substr) and all(data[0][i:i+j] in x for x in data):
                    substr = data[0][i:i+j]
    return substr

Hope this helps,

Jason.

Upvotes: 68

Ned Batchelder
Ned Batchelder

Reputation: 376082

def common_prefix(strings):
    """ Find the longest string that is a prefix of all the strings.
    """
    if not strings:
        return ''
    prefix = strings[0]
    for s in strings:
        if len(s) < len(prefix):
            prefix = prefix[:len(s)]
        if not prefix:
            return ''
        for i in range(len(prefix)):
            if prefix[i] != s[i]:
                prefix = prefix[:i]
                break
    return prefix

From http://bitbucket.org/ned/cog/src/tip/cogapp/whiteutils.py

Upvotes: 4

Related Questions