mr.M
mr.M

Reputation: 899

Fuzzy pattern search in a string: the most frequent patterns with d-mismatches

I would like to find all patterns that are 1) the most frequent in a string 2) have at most d-mismatches.

For this given task, I have implemented function that counts for a given patter how many times it occurs in a string with d-mismatches. The idea of this algorithm is based on using convolution of a bit mask of a sub-pattern of a string and a bit mask of a given pattern. It produces the correct results. Here the code for this algorithm:

def create_bit_mask(letter, text):
buf_array=[]
for c in text:
    if c==letter:
        buf_array.append(1)
    else:
        buf_array.append(0)
return buf_array

def convolution(bit_mask1, bit_mask2):
return sum(b*q for b,q in zip(bit_mask1, bit_mask2))

def number_of_occurances_with_at_most_hamming_distance(genome,pattern,hamming_distance):
alphabet=["A","C","G","T"]
matches=0
matrix_of_bit_arrays_for_pattern=[]
matrix_of_bit_arrays_for_genome=[]
buf_output=0
buf=0
positions=""

for symbol in alphabet:
    matrix_of_bit_arrays_for_pattern.append(create_bit_mask(symbol,pattern))
    matrix_of_bit_arrays_for_genome.append(create_bit_mask(symbol, genome))

for i in xrange(len(genome)-len(pattern)+1):
    buf_debug=[]

    buf=sum(convolution(bit_mask_pattern,bit_mask_genome[i:i+len(pattern)]) for bit_mask_pattern, bit_mask_genome in zip(matrix_of_bit_arrays_for_pattern,matrix_of_bit_arrays_for_genome))
    hamming=len(pattern)-buf
    if hamming<=hamming_distance:
        buf_output+=1
        #print "current window: ", genome[i:i+len(pattern)], "pattern :", pattern,"number of mismatches : ", hamming, " @ position : ",i 

return buf_output

Given the function above, the solution for the task should be rather simple i.e.

def fuzzy_frequency():
genome="CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGCCGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGGCCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCGGTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACACACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC"
length=10
hamming_distance=2

for i in range(len(genome)-length):
    #print genome[i:i+10], " number of occurances: ", number_of_occurances_with_at_most_hamming_distance(genome, genome[i:i+10],hamming_distance)
    print (genome[i:i+length],number_of_occurances_with_at_most_hamming_distance(genome, genome[i:i+length],hamming_distance))

However, the code above doesn't find the following sub-string:

GCACACAGAC

Could you give me a hit, why? I don't want you to post the correct code, rather give me an idea, where is my mistake (I assume mistake might be in the second function).

P.S. I do realize that the following task I have to solve on Stepic on-line course, but given no feedback from that on-line society that is for the studying group, I have posted my code at StackOverflow.

Upvotes: 2

Views: 1075

Answers (2)

PaulMcG
PaulMcG

Reputation: 63802

Aaaaand, here is an re-based solution:

genome = "CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGCCGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGGCCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCGGTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACACACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC"
length=10
hamming_distance=2


import re
from itertools import product

charset = "ACGT"

# first walk genome, get all unique N-character patterns
patterns = set()
for i in range(len(genome)-length):
    patterns.add(genome[i:i+length])


# for each pattern, create re's with all possible alternates
allmatches = {}
for p in sorted(patterns):
    options = [[c,"["+charset+"]"] for c in p]
    proditer = product(*options)

    matches = set()
    for pr in proditer:
        # count how many elements in this product start with '['
        # only want up to hamming_distance number of alternatives
        numalts = sum(prpart.startswith('[') for prpart in pr)
        if numalts > hamming_distance:
            continue

        compiled_pattRE = re.compile(''.join(pr))
        for match in filter(None, (compiled_pattRE.match(genome,i) 
                                        for i in range(len(genome)-length+1))):
            matches.add((match.start(), match.group(0)))

    allmatches[p] = matches

for p,matches in sorted(allmatches.items(), key=lambda m: -len(m[1])):
    print p, len(matches)
    for m in sorted(matches):
        print m
    print
    break

prints:

GCGCACACAC 20
(12, 'CGGCACACAC')
(14, 'GCACACACAG')
(126, 'GGGTACACAC')
(138, 'GGGCGCACAC')
(140, 'GCGCACACAC')
(142, 'GCACACACAG')
(213, 'CGGCACACAC')
(215, 'GCACACACAG')
(227, 'GCCCACACAC')
(253, 'GCGCACACAC')
(255, 'GCACACACAC')
(257, 'ACACACACAC')
(272, 'GCGCACAGCC')
(280, 'CCGCCCACAC')
(282, 'GCCCACACAC')
(284, 'CCACACACAC')
(316, 'GGGCGCACAC')
(318, 'GCGCACACAC')
(320, 'GCACACACAC')
(351, 'GCGCACAGCC')

Upvotes: 1

PaulMcG
PaulMcG

Reputation: 63802

I've had similar genome parsing questions posed on the pyparsing list, and I came up with this CloseMatch parser class. It encapsulates much of your string walking and testing code within pyparsing's own string parsing framework, but this still might give you some insights into your own code:

genome = "CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGCCGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGGCCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCGGTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACACACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC"
length=10
hamming_distance=2

from pyparsing import Token, ParseException
# following from pyparsing.wikispaces.com Examples page
class CloseMatch(Token): 
    """A special subclass of Token that does *close* matches. For each
       close match of the given string, a tuple is returned giving the
       found close match, and a list of mismatch positions."""
    def __init__(self, seq, maxMismatches=1): 
        super(CloseMatch,self).__init__() 
        self.name = seq 
        self.sequence = seq 
        self.maxMismatches = maxMismatches 
        self.errmsg = "Expected " + self.sequence 
        self.mayIndexError = False 
        self.mayReturnEmpty = False 

    def parseImpl( self, instring, loc, doActions=True ): 
        start = loc 
        instrlen = len(instring) 
        maxloc = start + len(self.sequence) 

        if maxloc <= instrlen: 
            seq = self.sequence 
            seqloc = 0 
            mismatches = [] 
            throwException = False 
            done = False 
            while loc < maxloc and not done: 
                if instring[loc] != seq[seqloc]: 
                    mismatches.append(seqloc) 
                    if len(mismatches) > self.maxMismatches: 
                        throwException = True 
                        done = True 
                loc += 1 
                seqloc += 1 
        else: 
            throwException = True 

        if throwException: 
            #~ exc = self.myException 
            #~ exc.loc = loc 
            #~ exc.pstr = instring 
            #~ raise exc 
            raise ParseException(instring, loc, self.errmsg)

        return loc, (instring[start:loc],mismatches) 


# first walk genome, get all unique N-character patterns
patterns = set()
for i in range(len(genome)-length):
    patterns.add(genome[i:i+length])
print len(patterns)

# use pyparsing's CloseMatch to find close matches - each match
# returns the substring and the list of mismatch locations
matches = {}
for p in sorted(patterns):
    matcher = CloseMatch(p, hamming_distance)
    matches[p] = list(matcher.scanString(genome, overlap=True))

# Now list out all patterns and number of close matches - for the most
# commonly matched pattern, dump out all matches, where they occurred and
# an annotated match showing the mismatch locations
first = True
for p in sorted(matches, key=lambda m: -len(matches[m])):
    if first:
        first = False
        for matchdata in matches[p]:
            matchvalue, start, end = matchdata
            substring,mismatches = matchvalue[0]
            print ' ', substring, 'at', start
            if mismatches:
                print ' ', ''.join('^' if i in mismatches else ' ' for i in range(length))
            else:
                print ' ', "***EXACT***"
            print
    print p, len(matches[p])

There are 254 unique 10-character patterns in the given genome.

Here is the output for the most commonly matched pattern:

  CGGCACACAC at 12
  ^^        

  GCACACACAG at 14
    ^      ^

  GGGTACACAC at 126
   ^ ^      

  GGGCGCACAC at 138
   ^  ^     

  GCGCACACAC at 140
  ***EXACT***

  GCACACACAG at 142
    ^      ^

  CGGCACACAC at 213
  ^^        

  GCACACACAG at 215
    ^      ^

  GCCCACACAC at 227
    ^       

  GCGCACACAC at 253
  ***EXACT***

  GCACACACAC at 255
    ^       

  ACACACACAC at 257
  ^ ^       

  GCGCACAGCC at 272
         ^^ 

  CCGCCCACAC at 280
  ^   ^     

  GCCCACACAC at 282
    ^       

  CCACACACAC at 284
  ^ ^       

  GGGCGCACAC at 316
   ^  ^     

  GCGCACACAC at 318
  ***EXACT***

  GCACACACAC at 320
    ^       

  GCGCACAGCC at 351
         ^^ 

Upvotes: 1

Related Questions