Fast way to find a substring with some mismatches allowed

Question

I am looking for help in making an efficient way to process some high throughput DNA sequencing data. The data are in 5 files with a few hundred thousand sequences each, within which each sequence is formatted as follows:

@M01102:307:000000000-BCYH3:1:1102:19202:1786 1:N:0:TAGAGGCA+CTCTCTCT

TAATACGACTCACTATAGGGTTAACTTTAAGAGGGAGATATACATATGAGTCTTTTGGGTAAGAAGCCTTTTTGTCTGCTTTATGGTCCTATCTGCGGCAGGGCCAGCGGCAGCTAGGACGGGGGGCGGATAAGATCGGAAGAGCACTCGTCTGAACTCCAGTCACTAGAGGCAATCTCGT

+

AAABBFAABBBFGGGGFGGGGGAG5GHHHCH54BEEEEA5GGHDHHHH5BAE5DF5GGCEB33AF3313GHHHE255D55D55D53@5@B5DBD5@E/@//>/1??/?/E@///FDF0B?CC??CAAA;--./;/BBE?;AFFA./;/;.;AEA//BFFFF/BB/////;/..:.9999.;

What I am doing at the moment is iterating over the lines, checking if the first and last letter is an allowed character for a DNA sequence (A/C/G/T or N), then doing a fuzzy search for the two primer sequences that flank the coding sequence fragment I am interested in. This last step is the part where things are going wrong...

When I search for exact matches, I get useable data in a reasonable time frame. However, I know I am missing out on a lot of data that is being skipped because of a single mis-match in the primer sequences. This happens because read quality degrades with length, and so more unreadable bases ('N') crop up. These aren't a problem in my analysis otherwise, but are a problem with a simple direct string search approach -- N should be allowed to match with anything from a DNA perspective, but is not from a string search perspective (I am less concerned about insertion or deletions). For this reason I am trying to implement some sort of fuzzy or more biologically informed search approach, but have yet to find an efficient way of doing it.

What I have now does work on test datasets, but is much too slow to be useful on a full size real dataset. The relevant fragment of the code is:

from Bio import pairwise2 
Sequence = 'NNNNNTAATACGACTCACTATAGGGTTAACTTTAAGAGGGAGATATACATATGAGTCTTTTGGGTAAGAAGCCTTTTTGTCTGCTTTATGGTCCTATCTGCGGCAGGGCCAGCGGCAGCTAGGACGGGGGGCGGATAAGATCGGAAGAGCACTCGTCTGAACTCCAGTCACTAGAGGCAATCTCGT'
fwdprimer = 'TAATACGACTCACTATAGGGTTAACTTTAAGAAGGAGATATACATATG'
revprimer = 'TAGGACGGGGGGCGGAAA'
if Sequence.endswith(('N','A','G','T','G')) and Sequence.startswith(('N','A','G','T','G')): 
    fwdalign = pairwise2.align.localxs(Sequence,fwdprimer,-1,-1, one_alignment_only=1)
    revalign = pairwise2.align.localxs(Sequence,revprimer,-1,-1, one_alignment_only=1)
    if fwdalign[0][2]>45 and revalign[0][2]>15:
        startIndex = fwdalign[0][3]+45
        endIndex = revalign[0][3]+3
        Sequence = Sequence[startIndex:endIndex]
        print Sequence

(obviously the first conditional is not needed in this example, but helps to filter out the other 3/4 of the lines that don't have DNA sequence and so don't need to be searched)

This approach uses the pairwise alignment method from biopython, which is designed for finding alignments of DNA sequences with mismatches allowed. That part it does well, but because it needs to do a sequence alignment for each sequence with both primers it takes way too long to be practical. All I need it to do is find the matching sequence, allowing for one or two mismatches. Is there another way of doing this that would serve my goals but be computationally more feasible? For comparison, the following code from a previous version works plenty fast with my full data sets:

 if ('TAATACGACTCACTATAGGGTTAACTTTAAGAAGGAGATATACATATG' in Line) and ('TAGGACGGGGGGCGGAAA' in Line):
    startIndex = Line.find('TAATACGACTCACTATAGGGTTAACTTTAAGAAGGAGATATACATATG')+45
    endIndex = Line.find('TAGGACGGGGGGCGGAAA')+3
    Line = Line[startIndex:endIndex]
    print Line

This is not something I run frequently, so don't mind if it is a little inefficient, but don't want to have to leave it running for a whole day. I would like to get a result in seconds or minutes, not hours.

jspcal · Accepted Answer

The tre library provides fast approximate matching functions. You can specify the maximum number of mismatched characters with maxerr as in the example below:

https://github.com/laurikari/tre/blob/master/python/example.py

There is also the regex module, which supports fuzzy searching options: https://pypi.org/project/regex/#additional-features

In addition, you can also use a simple regular expression to allow alternate characters as in:

# Allow any character to be N
pattern = re.compile('[TN][AN][AN][TN]')

if pattern.match('TANN'):
    print('found')

Fast way to find a substring with some mismatches allowed

Answers (1)

Related Questions