Reputation: 10709
I am a complete Python noob so please forgive my simple question. I am trying to write a script that will find all sequences in a huge string that match ATxxxCA, ATxxxxCA, ATxxxxxCA, or ATxxxxxxCA where x can be any character. When the ATxxxCA pattern is matched, I would then like the script to then capture the previous 10 and next 10 characters surrounding the matched ATxxxCA. For example, the result might look like this: aaaaaaaaaaATxxxCAbbbbbbbbbb
I attempted to do start the script like this:
SeqMatch = input("enter DNA sequence to search: ")
for s in re.findall(r'AT(.*?)CA', SeqMatch):
if len(s) is < 10:
print(s)
else:
print('no sequence matches')
I seem to be doing something wrong in my if loop? Can anyone help? Thanks in advance!
Upvotes: 1
Views: 341
Reputation: 27585
Take care of overlaps:
import re
adn = ('TCGCGCCCCCCCCCCATCAAGACATGGTTTTTTTTTTATTTATCAGATTACAGATACA'
'GTTATGGGGGGGGGGATATACAGATGCATAGCGATTAGCCTAGCTA')
regx = re.compile('(.{10})(AT.{3,6}CA)(.{10})')
res = regx.findall(adn)
for u in res:
print u
print
pat = re.compile('(.{10})(AT.{3,6}CA)')
li = []
for mat in pat.finditer(adn):
x = mat.end()
li.append(mat.groups()+(adn[x:x+10],))
for u in li:
print u
result
('CCCCCCCCCC', 'ATCAAGACA', 'TGGTTTTTTT')
('GGGGGGGGGG', 'ATATACA', 'GATGCATAGC')
('CCCCCCCCCC', 'ATCAAGACA', 'TGGTTTTTTT')
('TTTTTTTTTT', 'ATTTATCA', 'GATTACAGAT')
('GGGGGGGGGG', 'ATATACA', 'GATGCATAGC')
Upvotes: 1
Reputation: 65639
I seem to be doing something wrong in my if loop?
Python doesn't know what the meaning of "is" is (in this context).
Remove "is" from your if check,
if len(s) < 10:
print(s)
else:
print('no sequence matches')
You also said:
When the ATxxxCA pattern is matched, I would then like the script to then capture the previous 10 and next 10 characters surrounding the matched ATxxxCA. For example, the result might look like this: aaaaaaaaaaATxxxCAbbbbbbbbbb
If you want to capture the preceding/and postceding(?) 10 characters, change your regex to
(.{10})AT(.*)CA(.{10})
You'll get your 10 as in one result, followed by the stuff in between AT and CA, followed by your 10bs.
Or you can capture all of it by using one set of parethesis around the whole thing
(.{10}AT.*CA.{10})
Regexpal is a godsend for creating/debugging regexs.
Upvotes: 1
Reputation: 43034
Here's an example:
s = "a"*20 + "ATxxxxCA" + "b"*20
rec = re.compile(r'(AT.{3,6}CA)')
mo = rec.search(s)
print s[mo.start()-10:mo.end()+10]
Upvotes: 0