dk09
dk09

Reputation: 87

Count occurrences of a fuzzy sequence within a larger DNA sequence

I am trying to search and count the number of a small DNA sequence (R) that occurs within a larger sequence (F), but R has a few characters that could be variable. The easiest way I could think to do this would be to set a ratio for R and count all hits above 80% in F, but the only commands that seem to do this (eg. difflib's SequenceMatcher or get_close_matches) need lists to work. I cant break F into any such lists. Any ideas?

EDIT 2: More info as requested.

A set number of repeats (R) exist in a DNA fragment (F). F is 353 characters long and a single repeat is 15 characters long. No overlaps should occur as R is distinct enough to not overlap. The problem is that R can be variable, 2 out of the 15 characters can change or stay the same. I need to be able to detect these variations and any future variations that might occur. I am trying to avoid having a separate database full of these variations in R. The variable characters also may not be in the same position either, so using a regex like:

re.findall(pattern = "CTGCTTGGCGGG[TC]T[CG]", string = fragment)

can't work. Also, here is what I was using when trying it through difflib:

difflib.get_close_matches(repeat, fragment, cutoff = 0.85)

a repeat would be CTGCTTGGCGGGTTC and the DNA fragment would be AAAATTGCGGCATGTGGGCTGACTCTGAAAGCGATGCTCACGAAAAGGGAACGCGGCGCCGTCGGGCGCCGCGCGCCGCTTAGGACTGCTGGCCTGCGGCCGGCGCCTGCTTGGCGGGTTCCTGCTTGGCGGGCTCCTGCTTGGCGGGTTCCTGCTTGGCGGGTTCCTGCTTGGCGGGTTCCTGCTTGGCGGGCTGCTGCTTGGCGGGCTGCTGGGCCGGCGCCTGCTGGCCAGGAGCGGGCTGCTGGCCGGCAGGCGCCGCGCCCCCCTTGTTCCAGGGCGAAGCCTGCACCGGCGCCCCCGGACGGATCTTCTGGAAGCCTTCGACCACCACCACGTCTCCCGCCGCCAGG.

By repeat, I mean that R is repeated multiple times in the DNA fragment.

Thanks.

Upvotes: 1

Views: 455

Answers (1)

BioGeek
BioGeek

Reputation: 22857

Your questions is a bit short on details, so I have made a few assumptions.

If you can rewrite R as a list of lists, then you can just calculate all possible variations of R and look for those in F:

import re
from itertools import product

R = [['CTGCTTGGCGGG'] , ['T', 'C'], ['T'], ['C', 'G']]

F = 'AAAATTGCGGCATGTGGGCTGACTCTGAAAGCGATGCTCACGAAAAGGGAACGCGGCGCC' +\
    'GTCGGGCGCCGCGCGCCGCTTAGGACTGCTGGCCTGCGGCCGGCGCCTGCTTGGCGGGTT' +\
    'CCTGCTTGGCGGGCTCCTGCTTGGCGGGTTCCTGCTTGGCGGGTTCCTGCTTGGCGGGTT' +\
    'CCTGCTTGGCGGGCTGCTGCTTGGCGGGCTGCTGGGCCGGCGCCTGCTGGCCAGGAGCGG' +\
    'GCTGCTGGCCGGCAGGCGCCGCGCCCCCCTTGTTCCAGGGCGAAGCCTGCACCGGCGCCC' +\
    'CCGGACGGATCTTCTGGAAGCCTTCGACCACCACCACGTCTCCCGCCGCCAGG'

for repeat in product(*R):
    repeat = ''.join(repeat)
    matches = re.findall(repeat, F)
    if matches:
        print "The repeat '{}' is found {} time(s)".format(repeat, len(matches))

Gives as result:

The repeat 'CTGCTTGGCGGGTTC' is found 4 time(s)
The repeat 'CTGCTTGGCGGGCTC' is found 1 time(s)
The repeat 'CTGCTTGGCGGGCTG' is found 2 time(s)

Upvotes: 0

Related Questions