Reputation: 469
I have 2 files with around 100,000,000 lines that need to be compared to each other. As stated in the title I want to compare each line from the files to each other. I have the code below which works absolutely fine, however I wish to adapt it so that if a mismatch occurs during a long match then it's accepted with a mismatch level of 5%.
Below is the function I use for matching the lines of the files.
ret1 = []
merging = {}
def slide_merge(seq1, seq2):
for i in xrange(min(len(seq1), len(seq2))):
if seq1[i] == 'N':
ret1.append(seq1[i])
print (''.join(ret1))
elif seq2[i] == 'N':
ret1.append(seq1[i])
print (''.join(ret1))
elif seq1[i] != seq2[i]:
break
else:
ret1.append(seq1[i])
print (''.join(ret1))
print ("strings share a longest common prefix of length:", len(ret1), "out of:", len(seq1))
ret1len = len(ret1)
merging[''.join(ret1)] = ret1len # Adds details to dictionary
return merging
The below code is how the above function is used within the code and how I get the longest match.
while len(rc1u) >= 50: # So matches of 8 are included
slide_merge(rc1u, rc2wr) ### rc1u all cut up here so of no further use
rc1u = rc1u[1:]
merging
max(merging.iteritems(), key=operator.itemgetter(1))[0]
highest = max(merging.iteritems(), key=operator.itemgetter(1))[0]
highest
Incase it matters I am using HTSeq to input the files which are genetic sequencing.
So the question is how could I adapt this code or make another code which compares 2 strings and identifies the longest matching sequence from the start whilst allowing for 5% mismatches to occur so for example:
string1 = AAAAATTTTTCCCCCGGGGGTTTTT
string2 = AAAAATTTTTCCCCCGGGGATTTTT
The code should see that the 2 strings match entirely apart from 1 character but as that is less than 5% of the total the matched region should be stated as: matched 25
Upvotes: 1
Views: 902
Reputation: 3857
You can compute the Levenshtein distance between those words and then find the percentage of 'mismatch between those words'.
An example of implemenation is provided here.
Let's say that the function computing the distance between two strings is called dis_lev
, you could evaluate the percentage this way:
from __future__ import division
distance = dis_lev(string1, string2)
mismatch_ratio = distance / len(string1)
if mismatch_ratio > 0.05:
raise MyAwesomeException("Hey ! These things do not match at all !")
For example, using your example and the iterative implementation available in the links I've provided:
>>> distance = dis_lev("AAAAATTTTTCCCCCGGGGGTTTTT", "AAAAATTTTTCCCCCGGGGATTTTT")
>>> distance
1
>>> mismatch_ratio = distance / len("AAAAATTTTTCCCCCGGGGGTTTTT")
0.04
Edit: Depending on your case, you could use another metrics, some are listed here
Upvotes: 1