Count occurrences of a fuzzy sequence within a larger DNA sequence

Question

I am trying to search and count the number of a small DNA sequence (R) that occurs within a larger sequence (F), but R has a few characters that could be variable. The easiest way I could think to do this would be to set a ratio for R and count all hits above 80% in F, but the only commands that seem to do this (eg. difflib's SequenceMatcher or get_close_matches) need lists to work. I cant break F into any such lists. Any ideas?

EDIT 2: More info as requested.

A set number of repeats (R) exist in a DNA fragment (F). F is 353 characters long and a single repeat is 15 characters long. No overlaps should occur as R is distinct enough to not overlap. The problem is that R can be variable, 2 out of the 15 characters can change or stay the same. I need to be able to detect these variations and any future variations that might occur. I am trying to avoid having a separate database full of these variations in R. The variable characters also may not be in the same position either, so using a regex like:

re.findall(pattern = "CTGCTTGGCGGG[TC]T[CG]", string = fragment)

can't work. Also, here is what I was using when trying it through difflib:

difflib.get_close_matches(repeat, fragment, cutoff = 0.85)

a repeat would be CTGCTTGGCGGGTTC and the DNA fragment would be AAAATTGCGGCATGTGGGCTGACTCTGAAAGCGATGCTCACGAAAAGGGAACGCGGCGCCGTCGGGCGCCGCGCGCCGCTTAGGACTGCTGGCCTGCGGCCGGCGCCTGCTTGGCGGGTTCCTGCTTGGCGGGCTCCTGCTTGGCGGGTTCCTGCTTGGCGGGTTCCTGCTTGGCGGGTTCCTGCTTGGCGGGCTGCTGCTTGGCGGGCTGCTGGGCCGGCGCCTGCTGGCCAGGAGCGGGCTGCTGGCCGGCAGGCGCCGCGCCCCCCTTGTTCCAGGGCGAAGCCTGCACCGGCGCCCCCGGACGGATCTTCTGGAAGCCTTCGACCACCACCACGTCTCCCGCCGCCAGG.

By repeat, I mean that R is repeated multiple times in the DNA fragment.

Thanks.

Count occurrences of a fuzzy sequence within a larger DNA sequence

Answers (1)

Related Questions