hamvee
hamvee

Reputation: 141

How to catch the longest sequence of a group

The task is to find the longest sequence of a group

for instance, given DNA sequence: "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC" and it has 7 occurrences of AGATC. (AGATC) matches all occurrences. Is it possible to write a regular expression that catches only the longest sequence, i.e. AGATCAGATCAGATCAGATCAGATC in the given text? If this is not possible only with regex, how can I iterate through each sequence (i.e. 1st sequence is AGATCAGATC, 2nd - AGATCAGATCAGATCAGATCAGATC et cetera) in python?

Upvotes: 4

Views: 1781

Answers (3)

Cary Swoveland
Cary Swoveland

Reputation: 110725

The central question is, "Is it possible to write a regular expression that catches only the longest sequence?" The answer is "yes":

import re

s = 'AGATC_AGATCAGATC_AGATCAGATCAGATC_AGATC_AGATCAGATC'

m = re.search(r'((?:AGATC)+)(?!.*\1)', s)
print m.group() if m else ''
  #=> "AGATCAGATCAGATC"

Regex demo<¯\(ツ)>Python demo

Python's regex engine performs the following operations.

(            begin capture group 1
  (?:AGATC)  match 'AGATC' in a non-capture group
  +          execute the non-capture group 1+ times
)            end capture group 1
(?!          begin a negative lookahead
  .*         match 0+ characters
  \1         match the content of capture group 1
)            end the negative lookahead

For the string s above, AGATC would first be matched but the negative lookahead would find AGATC as the first part of AGATCAGATC, so the tentative match would be rejected. Then AGATCAGATC would be matched, but the negative lookahead would find AGATCAGATC as the first part of AGATCAGATCAGATC so that tentative match would also be rejected. Next, AGATCAGATCAGATC would be matched and accepted, as the negative lookahead would not find that match later in the string. (re.findall, unlike re.search, would also match AGATCAGATC at the end of the string.)

If re.findall were used there may be multiple matches after the longest one (see the last test string at the link to the regex demo), but the lengths of the matches are non-decreasing from the first to the last. Therefore, the first match, obtained using re.search is a longest match.

Upvotes: 2

Shubham Sharma
Shubham Sharma

Reputation: 71687

Use:

import re

sequence = "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC"
matches = re.findall(r'(?:AGATC)+', sequence)

# To find the longest subsequence
longest = max(matches, key=len)

Explanation:

Non-capturing group (?:AGATC)+

  • + Quantifier — Matches between one and unlimited times, as many times as possible.
  • AGATC matches the characters AGATC literally (case sensitive)

Result:

# print(matches)
['AGATCAGATC', 'AGATCAGATCAGATCAGATCAGATC']

# print(longest)
'AGATCAGATCAGATCAGATCAGATC'

You can test the regex here.

Upvotes: 4

RootTwo
RootTwo

Reputation: 4418

Use re.finditer() to iterate over all matches. Then use max() with a key function to find the longest. Make it a function so you can use different groups.

import re

def find_longest(sequence, group):
    # build pattern
    pattern = fr"(?:{group})+"

    # iterate over all matches
    matches = (match[0] for match in re.finditer(pattern, sequence))

    # find the longest
    return max(matches, key=len)

seq = "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC"

find_longest(seq, "AGATC")

Upvotes: 1

Related Questions