Huzaifa
Huzaifa

Reputation: 532

How to find repeated substring in a string using regular expressions in Python?

I am trying to find the longest consecutive chain of repeated DNA nucleotides in a DNA sequence. The DNA sequence is a string. So, for example, if I have "AGA", I would want to know the length of the longest consecutive chain of repeats of "AGA" in the chain.

I am thinking of using regular expressions to extract all the chains of repeats of the nucleotides and store them in a list (using re.findall()). Then simply find the longest chain out of them, take its length and divide it by the length of the sequence of nucleotides.

What regular expression can I write for this? I was thinking, for example [AGA]+, but it would identify substrings with A or G or A. I want something similar, so that it identifies "AGA" and its repeats.

Note: if the sequence is AATGAGAAGAAGATCCTAGAAGAAGAAGAAGACGAT, there are two chains of consecutive "AGA", one of length 3 and the other of length 5. The longest chain is therefore of length 5.

Upvotes: 1

Views: 2320

Answers (3)

Cary Swoveland
Cary Swoveland

Reputation: 110755

You could use the first match the following regular expression:

r'((?:AGA)+)(?!.*\1)'

Python code <¯\(ツ)> Start your engine!

Python's regex engine performs the following operations.

(          : begin capture group 1
  (?:AGA)  : match 'AGA' in a non-capture group
  +        : execute non-capture group 1+ times
)          : end capture group 1
(?!        : begin negative lookahead
  .*       : match any character other than line terminators 0+ times 
  \1       : match contents of capture group 1
)          : end negative lookahead

This rejects a candidate string of "AGA"'s if there is another string of "AGA"'s later in the string that is at least as long as the candidate string.

There may well be multiple matches. If, for example, the string were

AGAAGAAGATAGATAGAAGATAGA
^^^^^^^^^     ^^^^^^ ^^^

there would be, as I indicated by the party hats, three matches. As the matches are always non-decreasing in length from left to right, no match will be longer than the first match. We therefore may select the first match.

If one wanted to identify all longest matches (should there be more than one having the longest length), one could use the above regex to obtain a match of, say, four 'ABA‘s, and then match the string with the regex r'(?:ABA){4}'.

Upvotes: 1

pecey
pecey

Reputation: 691

This is another way to do find matching subsequences.

re.findall("(?:AGA)+", "AATGAGAAGAAGATCCTAGAAGAAGAAGAAGACGAT")

Upvotes: 0

Andrej Kesely
Andrej Kesely

Reputation: 195613

You can use expression ((AGA)\2*) (regex101):

For example:

s = 'AATGAGAAGAAGATCCTAGAAGAAGAAGAAGACGAT'

to_find = 'AGA'

m = max(re.findall(r'(({})\2*)'.format(to_find), s), key=lambda k: k[0])[0]
print(m, len(m) // len(to_find))

Prints:

AGAAGAAGAAGAAGA 5

Upvotes: 1

Related Questions