Reputation: 195
Fancy title :) I have a file that contains the following:
>sequence_40
ABCDABDCABCDBACDBACDBACDBACDABDCDC
ACDCCDCABDCADCADBCACBDCABD
>sequence_41
DCBACDBACDADCDCDCABCDCACBDCBDACBDC
BCDBABABBABACDCDBCACDBACDBACDBACDC
BCDB
...
Then, I have a function that returns a dictionary (called dict) that returns the sequences as keys and the strings (combined on one line) as values for the keys. The sequences range from 40 to 59. I want to take a dictionary of sequences and return the longest common sub-sequence found in ALL the sequences. Managed to find some help here on stackoverflow and made a code that only compares the LAST TWO strings in that dictionary, not all of them :). This is the code
def longest_common_sequence(s1, s2):
m = [[0] * (1 + len(s2)) for i in range(1 + len(s1))]
longest, x_longest = 0, 0
for x in range(1, 1 + len(s1)):
for y in range(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
for i in range(40,59):
s1=str(dictionar['sequence_'+str(i)])
s2=str(dictionar['sequence_'+str(i+1)])
longest_common_sequence(s1,s2)
How can I modify it to get the common subsequence among ALL sequences in dictionary? Thanks!
Upvotes: 1
Views: 783
Reputation: 59516
EDIT: As @lmcarreiro pointed out, there is a relevant difference between substrings (or subarrays or sublists) and subsequences. To my understanding we are all talking about substrings here, so I will use this term in my answer.
Guillaumes answer can be improved:
def eachPossibleSubstring(string):
for size in range(len(string) + 1, 0, -1):
for start in range(len(string) - size + 1):
yield string[start:start+size]
def findLongestCommonSubstring(strings):
shortestString = min(strings, key=len)
for substring in eachPossibleSubstring(shortestString):
if all(substring in string
for string in strings if string != shortestString):
return substring
print findLongestCommonSubstring([
'ABCDABDCABCDBACDBACDBACDBACDABDCDCACDCCDCABDCADCADBCACBDCABD',
'DCBACDBACDADCDCDCABCDCACBDCBDACBDCBCDBABABBABACDCDBCACDBACDBACDBACDCBCDB',
])
This prints:
ACDBACDBACDBACD
This is faster because I return the first found and search from longest to shortest.
The basic idea is this: Take each possible substring of the shortest of your strings (in the order from the longest to the shortest) and see if this substring can be found in all other strings. If so, return it, otherwise try the next substring.
You need to understand generators. Try e. g. this:
for substring in eachPossibleSubstring('abcd'):
print substring
or
print list(eachPossibleSubstring('abcd'))
Upvotes: 2
Reputation: 6009
I'd start by defining a function to return all possible subsequences of a given sequence:
from itertools import combinations_with_replacement
def subsequences(sequence):
"returns all possible subquences of a given sequence"
for start, stop in combinations_with_replacement(range(len(sequence)), 2):
if start < stop:
yield sequence[start:stop]
then I'd make another method to check if a given subsequence in present in all given sequences:
def is_common_subsequence(sub, sequences):
"returns True if <sub> is a common subsequence in all <sequences>"
return all(sub in sequence for sequence in sequences)
then using the 2 methods above it is pretty easy to get all common subsequences in a given set of sequences:
def common_sequences(sequences):
"return all subsequences common in sequences"
shortest_seq = min(sequences, key=len)
return set(subsequence for subsequence in subsequences(shortest_seq) \
if is_common_subsequence(subsequence, sequences))
... and extracting the longuest sequence:
def longuest_common_subsequence(sequences):
"returns the longuest subsequence in sequences"
return max(common_sequences(sequences), key=len)
Result:
sequences = {
41: 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
42: '123ABCDEFGHIJKLMNOPQRSTUVW',
43: '123456ABCDEFGHIJKLMNOPQRST'
}
sequences2 = {
0: 'ABCDEFGHIJ',
1: 'DHSABCDFKDDSA',
2: 'SGABCEIDEFJRNF'
}
print(longuest_common_subsequence(sequences.values()))
>>> ABCDEFGHIJKLMNOPQRST
print(longuest_common_subsequence(sequences2.values()))
>>> ABC
Upvotes: 1
Reputation: 4866
Here you have a possible approach. First let's define a function that returns the longest substring between two strings:
def longest_substring(s1, s2):
t = [[0]*(1+len(s2)) for i in range(1+len(s1))]
l, xl = 0, 0
for x in range(1,1+len(s1)):
for y in range(1,1+len(s2)):
if s1[x-1] == s2[y-1]:
t[x][y] = t[x-1][y-1] + 1
if t[x][y]>l:
l = t[x][y]
xl = x
else:
t[x][y] = 0
return s1[xl-l: xl]
Now I'll create a random dict
of sequences for the example:
import random
import string
d = {i : ''.join(random.choice(string.ascii_uppercase) for _ in range(50)) for i in range(10)}
print d
{0: 'ASCUCEVJNIGWVMWMBBQQBZYBBNGQAJRYXACGFEIFWHMBCNYRGL', 1: 'HKUKZOJJUCRTSBLNZXCIBARLPNAPAABRBZEVGVILJAFCGWGQVV', 2: 'MMHCYPKECRJFEWTGYITMHZSNHAFEZVFYDAVILRYRKIDDBEFRVX', 3: 'DGBULRFJINFZEELDASRFBIRSADWMRAYMGCDAOJDKQIMXIRLTEI', 4: 'VDUFWZSXLRGOIMAHOAMZAIWDPTHDVDXUACRBASJMCUHREDORRH', 5: 'RFGAVHOWNKRZMYMSFSSNUGCKEWUNVETCDWJXSPBJHKSTPFNSJO', 6: 'HFMLMHCFSOEXBXWFAROIRGJNPRTKRWCEPLFOKGMXNUPCPWREWX', 7: 'CNPGSHGVIRLDXAADXUVWCTJCXUHQLALBUOJMXQBKXWHKGSJHEH', 8: 'UWDXXTRCFNCBUBEYGYTDWTPLNTRHYQWKTHPRVCBAWIMNGHULDC', 9: 'OOCJRXBZKJIGHZEJOOIKWKMQKIEQVPEDTFPJQAUQKJQVLOMGJB'}
Finally, we need to find the longest subsequence between all sequences:
import itertools
max([longest_substring(i,j) for i,j in itertools.combinations(d.values(), 2)], key=len)
Output:
'VIL'
Upvotes: 0