drbunsen
drbunsen

Reputation: 10709

python len function question

I am a complete Python noob so please forgive my simple question. I am trying to write a script that will find all sequences in a huge string that match ATxxxCA, ATxxxxCA, ATxxxxxCA, or ATxxxxxxCA where x can be any character. When the ATxxxCA pattern is matched, I would then like the script to then capture the previous 10 and next 10 characters surrounding the matched ATxxxCA. For example, the result might look like this: aaaaaaaaaaATxxxCAbbbbbbbbbb

I attempted to do start the script like this:

SeqMatch = input("enter DNA sequence to search: ")
for s in re.findall(r'AT(.*?)CA', SeqMatch):
    if len(s) is < 10:
        print(s)
    else:
        print('no sequence matches')

I seem to be doing something wrong in my if loop? Can anyone help? Thanks in advance!

Upvotes: 1

Views: 341

Answers (3)

eyquem
eyquem

Reputation: 27585

Take care of overlaps:

import re

adn = ('TCGCGCCCCCCCCCCATCAAGACATGGTTTTTTTTTTATTTATCAGATTACAGATACA'
       'GTTATGGGGGGGGGGATATACAGATGCATAGCGATTAGCCTAGCTA')


regx = re.compile('(.{10})(AT.{3,6}CA)(.{10})')
res = regx.findall(adn)
for u in res:
    print u

print

pat = re.compile('(.{10})(AT.{3,6}CA)')
li = []
for mat in pat.finditer(adn):
    x = mat.end()
    li.append(mat.groups()+(adn[x:x+10],))
for u in li:
    print u

result

('CCCCCCCCCC', 'ATCAAGACA', 'TGGTTTTTTT')
('GGGGGGGGGG', 'ATATACA', 'GATGCATAGC')

('CCCCCCCCCC', 'ATCAAGACA', 'TGGTTTTTTT')
('TTTTTTTTTT', 'ATTTATCA', 'GATTACAGAT')
('GGGGGGGGGG', 'ATATACA', 'GATGCATAGC')

Upvotes: 1

Doug T.
Doug T.

Reputation: 65639

I seem to be doing something wrong in my if loop?

Python doesn't know what the meaning of "is" is (in this context).

Remove "is" from your if check,

if len(s) < 10:
    print(s)
else:
    print('no sequence matches')

You also said:

When the ATxxxCA pattern is matched, I would then like the script to then capture the previous 10 and next 10 characters surrounding the matched ATxxxCA. For example, the result might look like this: aaaaaaaaaaATxxxCAbbbbbbbbbb

If you want to capture the preceding/and postceding(?) 10 characters, change your regex to

 (.{10})AT(.*)CA(.{10})

You'll get your 10 as in one result, followed by the stuff in between AT and CA, followed by your 10bs.

Or you can capture all of it by using one set of parethesis around the whole thing

 (.{10}AT.*CA.{10})

Regexpal is a godsend for creating/debugging regexs.

Upvotes: 1

Keith
Keith

Reputation: 43034

Here's an example:

s = "a"*20 + "ATxxxxCA" + "b"*20
rec = re.compile(r'(AT.{3,6}CA)')
mo = rec.search(s)
print s[mo.start()-10:mo.end()+10]

Upvotes: 0

Related Questions