Reputation: 87
I am trying to search and count the number of a small DNA sequence (R) that occurs within a larger sequence (F), but R has a few characters that could be variable. The easiest way I could think to do this would be to set a ratio for R and count all hits above 80% in F, but the only commands that seem to do this (eg. difflib's SequenceMatcher or get_close_matches) need lists to work. I cant break F into any such lists. Any ideas?
EDIT 2: More info as requested.
A set number of repeats (R) exist in a DNA fragment (F). F is 353 characters long and a single repeat is 15 characters long. No overlaps should occur as R is distinct enough to not overlap. The problem is that R can be variable, 2 out of the 15 characters can change or stay the same. I need to be able to detect these variations and any future variations that might occur. I am trying to avoid having a separate database full of these variations in R. The variable characters also may not be in the same position either, so using a regex like:
re.findall(pattern = "CTGCTTGGCGGG[TC]T[CG]", string = fragment)
can't work. Also, here is what I was using when trying it through difflib:
difflib.get_close_matches(repeat, fragment, cutoff = 0.85)
a repeat would be CTGCTTGGCGGGTTC
and the DNA fragment would be AAAATTGCGGCATGTGGGCTGACTCTGAAAGCGATGCTCACGAAAAGGGAACGCGGCGCCGTCGGGCGCCGCGCGCCGCTTAGGACTGCTGGCCTGCGGCCGGCGCCTGCTTGGCGGGTTCCTGCTTGGCGGGCTCCTGCTTGGCGGGTTCCTGCTTGGCGGGTTCCTGCTTGGCGGGTTCCTGCTTGGCGGGCTGCTGCTTGGCGGGCTGCTGGGCCGGCGCCTGCTGGCCAGGAGCGGGCTGCTGGCCGGCAGGCGCCGCGCCCCCCTTGTTCCAGGGCGAAGCCTGCACCGGCGCCCCCGGACGGATCTTCTGGAAGCCTTCGACCACCACCACGTCTCCCGCCGCCAGG
.
By repeat, I mean that R is repeated multiple times in the DNA fragment.
Thanks.
Upvotes: 1
Views: 455
Reputation: 22857
Your questions is a bit short on details, so I have made a few assumptions.
If you can rewrite R
as a list of lists, then you can just calculate all possible variations of R and look for those in F
:
import re
from itertools import product
R = [['CTGCTTGGCGGG'] , ['T', 'C'], ['T'], ['C', 'G']]
F = 'AAAATTGCGGCATGTGGGCTGACTCTGAAAGCGATGCTCACGAAAAGGGAACGCGGCGCC' +\
'GTCGGGCGCCGCGCGCCGCTTAGGACTGCTGGCCTGCGGCCGGCGCCTGCTTGGCGGGTT' +\
'CCTGCTTGGCGGGCTCCTGCTTGGCGGGTTCCTGCTTGGCGGGTTCCTGCTTGGCGGGTT' +\
'CCTGCTTGGCGGGCTGCTGCTTGGCGGGCTGCTGGGCCGGCGCCTGCTGGCCAGGAGCGG' +\
'GCTGCTGGCCGGCAGGCGCCGCGCCCCCCTTGTTCCAGGGCGAAGCCTGCACCGGCGCCC' +\
'CCGGACGGATCTTCTGGAAGCCTTCGACCACCACCACGTCTCCCGCCGCCAGG'
for repeat in product(*R):
repeat = ''.join(repeat)
matches = re.findall(repeat, F)
if matches:
print "The repeat '{}' is found {} time(s)".format(repeat, len(matches))
Gives as result:
The repeat 'CTGCTTGGCGGGTTC' is found 4 time(s)
The repeat 'CTGCTTGGCGGGCTC' is found 1 time(s)
The repeat 'CTGCTTGGCGGGCTG' is found 2 time(s)
Upvotes: 0