Irek
Irek

Reputation: 439

Python, find the length of the (n)n string by regex

i have a code looking like this:

import HTSeq
reference = open('genome.fa','r')
sequences = dict( (s.name, s) for s in HTSeq.FastaReader(reference))
out = open('homopolymers_in_ref','w')

def find_all(a_str,sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1: return
        yield start
        start += len(sub)
homa = 'AAAAAAAAAA'
homc = 'CCCCCCCCCC'
homg = 'GGGGGGGGGG'
homt = 'TTTTTTTTTT'
for key,line in sequences.items():
    seq = str(line)
    a= list(find_all(seq,homa))
    c = list(find_all(seq,homc))
    g = list(find_all(seq,homg))
    t = list(find_all(seq,homt))
    for i in a:
##        print i,key,'A'
        out.write(str(i)+'\t'+str(key)+'\t'+'A'+'\n')
    for i in c:
        out.write(str(i)+'\t'+str(key)+'\t'+'C'+'\n')
##        print i,key,'C'
    for i in g:
        out.write(str(i)+'\t'+str(key)+'\t'+'G'+'\n')
    for i in t:
        out.write(str(i)+'\t'+str(key)+'\t'+'T'+'\n')
out.close()

I used HTSeq to open the reference. What it does - it looks for simple homopolymers of length 10 and outputs start position, chromosome and type (A,C,T,G,).

THE sequence always looks like: ACCGCTACGATCGATCGAAAAAAAAAAAAAAAAAACGATCGAC sometimes it contains N

so our homopolymer that we are looking for is: AAAAAAAAAA (or other composed of only C,G,T)

Basically the help from you is about the find_all function only: Now what I would like to change is finding the length of each homopolymer. Since, right now if a homopolymer has a length 15, my script can't tell it. I was thinking of doing it by some sort of regex, namely: find at least 10 bp like now and compute length by adding +1 to it until the next base isn't like those in homopolymer.

Any suggestions how to use a regex to do it in python?

Upvotes: 2

Views: 965

Answers (1)

arshajii
arshajii

Reputation: 129537

If you want to do this with a regex you can try something like:

>>> import re
>>> seq = 'ACCGCTACGATCGATCGAAAAAAAAAAAAAAAAAACGATCGAC'
>>>
>>> [(m.group(), m.start())
...     for m in re.finditer(r'([ACGT])\1{9,}', seq)
...         if len(m.group()) >= 10]
[('AAAAAAAAAAAAAAAAAA', 17)]

This produces a list of (sequence, start_index) tuples.

Upvotes: 6

Related Questions