Matching 2 strings and allowing a 5% mismatch rate

Question

I have 2 files with around 100,000,000 lines that need to be compared to each other. As stated in the title I want to compare each line from the files to each other. I have the code below which works absolutely fine, however I wish to adapt it so that if a mismatch occurs during a long match then it's accepted with a mismatch level of 5%.

Below is the function I use for matching the lines of the files.

ret1 = []
merging = {} 
def slide_merge(seq1, seq2):
    for i in xrange(min(len(seq1), len(seq2))):
        if seq1[i] == 'N':
            ret1.append(seq1[i])
            print (''.join(ret1))
        elif seq2[i] == 'N':
            ret1.append(seq1[i])
            print (''.join(ret1))
        elif seq1[i] != seq2[i]:
            break
        else:
            ret1.append(seq1[i])
            print (''.join(ret1))
    print ("strings share a longest common prefix of length:", len(ret1), "out of:", len(seq1))
    ret1len = len(ret1)
    merging[''.join(ret1)] = ret1len # Adds details to dictionary
    return merging

The below code is how the above function is used within the code and how I get the longest match.

while len(rc1u) >= 50: # So matches of 8 are included
    slide_merge(rc1u, rc2wr)      ### rc1u all cut up here so of no further use
    rc1u = rc1u[1:]
merging
max(merging.iteritems(), key=operator.itemgetter(1))[0]
highest = max(merging.iteritems(), key=operator.itemgetter(1))[0]
highest

Incase it matters I am using HTSeq to input the files which are genetic sequencing.

So the question is how could I adapt this code or make another code which compares 2 strings and identifies the longest matching sequence from the start whilst allowing for 5% mismatches to occur so for example:

string1 = AAAAATTTTTCCCCCGGGGGTTTTT
string2 = AAAAATTTTTCCCCCGGGGATTTTT

The code should see that the 2 strings match entirely apart from 1 character but as that is less than 5% of the total the matched region should be stated as: matched 25

Matching 2 strings and allowing a 5% mismatch rate

Answers (1)

Related Questions