Tom
Tom

Reputation: 469

Matching 2 strings and allowing a 5% mismatch rate

I have 2 files with around 100,000,000 lines that need to be compared to each other. As stated in the title I want to compare each line from the files to each other. I have the code below which works absolutely fine, however I wish to adapt it so that if a mismatch occurs during a long match then it's accepted with a mismatch level of 5%.

Below is the function I use for matching the lines of the files.

ret1 = []
merging = {} 
def slide_merge(seq1, seq2):
    for i in xrange(min(len(seq1), len(seq2))):
        if seq1[i] == 'N':
            ret1.append(seq1[i])
            print (''.join(ret1))
        elif seq2[i] == 'N':
            ret1.append(seq1[i])
            print (''.join(ret1))
        elif seq1[i] != seq2[i]:
            break
        else:
            ret1.append(seq1[i])
            print (''.join(ret1))
    print ("strings share a longest common prefix of length:", len(ret1), "out of:", len(seq1))
    ret1len = len(ret1)
    merging[''.join(ret1)] = ret1len # Adds details to dictionary
    return merging

The below code is how the above function is used within the code and how I get the longest match.

while len(rc1u) >= 50: # So matches of 8 are included
    slide_merge(rc1u, rc2wr)      ### rc1u all cut up here so of no further use
    rc1u = rc1u[1:]
merging
max(merging.iteritems(), key=operator.itemgetter(1))[0]
highest = max(merging.iteritems(), key=operator.itemgetter(1))[0]
highest

Incase it matters I am using HTSeq to input the files which are genetic sequencing.

So the question is how could I adapt this code or make another code which compares 2 strings and identifies the longest matching sequence from the start whilst allowing for 5% mismatches to occur so for example:

string1 = AAAAATTTTTCCCCCGGGGGTTTTT
string2 = AAAAATTTTTCCCCCGGGGATTTTT

The code should see that the 2 strings match entirely apart from 1 character but as that is less than 5% of the total the matched region should be stated as: matched 25

Upvotes: 1

Views: 902

Answers (1)

Ketouem
Ketouem

Reputation: 3857

You can compute the Levenshtein distance between those words and then find the percentage of 'mismatch between those words'.

An example of implemenation is provided here.

Let's say that the function computing the distance between two strings is called dis_lev, you could evaluate the percentage this way:

from __future__ import division

distance = dis_lev(string1, string2)
mismatch_ratio = distance / len(string1)
if mismatch_ratio > 0.05:
    raise MyAwesomeException("Hey ! These things do not match at all !")

For example, using your example and the iterative implementation available in the links I've provided:

>>> distance = dis_lev("AAAAATTTTTCCCCCGGGGGTTTTT", "AAAAATTTTTCCCCCGGGGATTTTT")
>>> distance
1
>>> mismatch_ratio = distance / len("AAAAATTTTTCCCCCGGGGGTTTTT")
0.04 

Edit: Depending on your case, you could use another metrics, some are listed here

Upvotes: 1

Related Questions