kish-an
kish-an

Reputation: 75

Regex - Counting greatest number of short tandem repeats

Im looping through a list of Short Tandem Repeats and trying to find the greatest amount of times they occur consecutively in a sequence of DNA.

Sequence:

AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGAGATCAGATCAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAGATCAGATC
STRs = ['AGATC', 'AATG', 'TATC']

for STR in STRs:
    max_repeats = len(re.findall(f'(?<={STR}){STR}', sequence))
    print(max_repeats)

The greatest amount of consecutive repeats in the sequence is 4 STR

My current regex is only returning 3 matches because of the positive lookbehind, I'm also not too sure how to modify my code to capture the greatest amount of repeats (4) without also including the other repeat groups, is it possible to retrieve a nested list of the matches which I could iterate over to find the greatest number like below.

[[AGATC, AGATC, AGATC, AGATC], [AGATAC,AGATC], [AGATC, AGATC]]

Upvotes: 2

Views: 1250

Answers (4)

Raymond Hettinger
Raymond Hettinger

Reputation: 226486

Here's one way to approach the problem:

>>> STRs = ['AGATC', 'AATG', 'TATC']
>>> pattern = '|'.join(f'({tgt})+' for tgt in STRs)
>>> for mo in re.finditer(pattern, seq):
         print(mo.group(0))

AGATCAGATCAGATCAGATC
TATCTATCTATCTATCTATC
AGATCAGATC
AATG
AGATCAGATC

Key ideas:

1) The core pattern used + for a group to allow for consecutive repeats:

(AGATC)+

2) The patterns are combined with | to allow any STR to match:

>>> pattern
'(AGATC)+|(AATG)+|(TATC)+'

3) The re.finditer() call gives the match objects one at a time.

4) If needed the match object can give other information like start and stop points to compute length or a tuple to show which STR matched:

>>> mo.group(0)
'AGATCAGATC'
>>> mo.span()
(171, 181)
>>> mo.groups()
('AGATC', None, None)

How to count the maximum repetitions:

>>> tracker = dict.fromkeys(STRs, 0)
>>> for mo in re.finditer(pattern, seq):
        start, end = mo.span()
        STR = next(filter(None, mo.groups()))
        reps = (end - start) // len(STR)
        tracker[STR] = max(tracker[STR], reps)

>>> tracker
{'AGATC': 4, 'AATG': 1, 'TATC': 5}

Upvotes: 1

Tomalak
Tomalak

Reputation: 338316

When there is nothing dynamic in the string that you're searching for, and especially when the string that you are searching in is large, I would try to find a solution that avoids regex, because basic string methods typically are significantly faster than regex.

The following is probably un-Pythonic in more ways than one (improvement suggestions welcome), but the underlying assumption is that str.split() massively outperforms regex, and the arithmetic to calculate string positions takes negligible time.

def find_tandem_repeats(sequence, search):
    """ searches through an DNA sequence and returns (position, repeats) tuples """
    if sequence == '' or search == '':
        return

    lengths = list(map(len, sequence.split(search)))
    pos = lengths[0]
    repeats = 0
    pending = False

    for l in lengths[1:]:
        if l == 0:
            pending = True
            repeats += 1
            continue
        repeats += 1
        yield (pos, repeats)
        pos += l + len(search) * repeats
        repeats = 0
        pending = False

    if pending:
        yield (pos, repeats)

usage:

data = "AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGAGATCAGATCAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAGATCAGATC"
positions = list(find_tandem_repeats(data, 'AGATC'))

for position in positions:
    print("at index %s, repeats %s times" % position)

max_position = max(positions, key=lambda x: x[1])
print("maximum at index %s, repeats %s times" % max_position)

Output

at index 55, repeats 4 times
at index 104, repeats 2 times
at index 171, repeats 2 times
maximum at index 55, repeats 4 times

Upvotes: 1

flappix
flappix

Reputation: 2227

Here is a functional programming approach. In each recursion step the first occurrences of a consecutive repeat is found and after that eliminated from the search string. You keep record of the maximum of repeats found so far and pass it to each recursive call until no consecutive repeats are found anymore.

def find_repeats (pattern, seq, max_r=0):
    g = re.search(f'{pattern}({pattern})+', seq)
    if g:
        max_repeats = len ( g.group() ) / len(pattern)
        return find_repeats (pattern, seq.replace (g.group(), '', 1), max (max_r, max_repeats) )
    else:
        return max_r

print (find_repeats ('AGATC', sequence))

Upvotes: 0

damon
damon

Reputation: 15128

To find all sequences of AGATC as long as it's not at the end of the sequence, you can use:

>>> re.findall(r'AGATC\B', sequence)
['AGATC', 'AGATC', 'AGATC', 'AGATC']

From the python documentation on \B:

Matches the empty string, but only when it is not at the beginning or end of a word. This means that r'py\B' matches 'python', 'py3', 'py2', but not 'py', 'py.', or 'py!'. \B is just the opposite of \b, so word characters in Unicode patterns are Unicode alphanumerics or the underscore, although this can be changed by using the ASCII flag. Word boundaries are determined by the current locale if the LOCALE flag is used.

Upvotes: 1

Related Questions