Reputation: 13
I am writing a program that is designed to match two sequences. I already have two lists which contains SeqRecord objects with the suffixes F and R are separately. Now I would like to select one sequence from list F and find the most similar one from the list R. I would like to base my search on seq_record.id. And then do global matching for these two similar sequences. And repeat it the same for each sequence in the list f.
Here are sample id names from the list f: BIE-1_ITS5 ; BIE-2_ITS5 ; BIE-3_ITS5 ; KAZ-5_ITS5
And here from list r: BIE-1_ITS4 ; BIE-2_ITS4 ; BIE-3_ITS4 ; KAZ-5_ITS4
The point is, for example, for the sequence with the id number BIE-1_ITS5 to find in the list r the sequence BIE-1_ITS4 and for them to do global sequence alignment.
Matching first with first, second with second is not preferred option, because there may be sequences that will not have any pair.
Thanks for any answer
Upvotes: 0
Views: 225
Reputation: 1521
I used a string matching algorithm to find the similarity score of two strings and found the most similar string:
import difflib
import numpy as np
def getScore(item1,item2):
return float(difflib.SequenceMatcher(None, item1, item2).ratio()*100)
def getMostSimilar(f,r):
result={}
for i in f:
scores=[0]*len(r)
for ind,j in enumerate(r):
scores[ind]=getScore(i,j)
print(scores)
ind = np.argmax(scores)
result[i]=r[ind]
return result
f=['BIE-1_ITS5','BIE-2_ITS5','BIE-3_ITS5','KAZ-5_ITS5']
r=['BIE-1_ITS4','BIE-2_ITS4','BIE-3_ITS4','KAZ-5_ITS4']
print(getMostSimilar(f,r))
I got the following result:
[90.0, 80.0, 80.0, 50.0]
[80.0, 90.0, 80.0, 50.0]
[80.0, 80.0, 90.0, 50.0]
[50.0, 50.0, 50.0, 90.0]
{'BIE-1_ITS5': 'BIE-1_ITS4',
'BIE-2_ITS5': 'BIE-2_ITS4',
'BIE-3_ITS5': 'BIE-3_ITS4',
'KAZ-5_ITS5': 'KAZ-5_ITS4'}
The printed dictionary is a mapping of the most similar items.
Note: This does not return a unique mapping since that will require more information about how to map, i.e. first come first serve or max matching score(which will need to have tie breaker cases)
Upvotes: 1