Ryguy
Ryguy

Reputation: 126

Looking for a python function to find the longest sequentially repeated substring in a string

I'm doing some coding with DNA sequences and I'm interested in a function to find sequential repeats (which could represent where primers could 'slip' AKA do bad stuff).

An example of what I'm interested in would be as follows:

longest_repeat('ATTTTCCATGATGATG')

which would output the repeat length and coordinates, in this case 9 long and 7:15. The function should have picked up the ATGATGATG at the end and since it is longer than the TTTT repeat and the TGATGA repeat, it would only report the ATGATGATG. In the case of ties, I'd like if it could report all the ties, or at least one of them.

It would also be nice to set a threshold to only report these sequential repeats if they're over a specific length.

I have some experience in python, but this specific question has me stumped, since if I code it inefficiently and put in a 50 character long string it could take forever. I appreciate all the help!

Upvotes: 3

Views: 772

Answers (2)

IoaTzimas
IoaTzimas

Reputation: 10624

The following will work pretty efficiently. It returns the longest sequence, its length, its starting index and its ending index. If there are multiple sequencies of max length, result will be a list of them. Second parameter in function longest(s, threshold) is the desired threshold-minimum length:

import numpy as np

def b(n): #it returns the factors of an integer. It will be used in next function
    r = np.arange(1, int(n ** 0.5) + 1)
    x = r[np.mod(n, r) == 0]
    return set(np.concatenate((x, n / x), axis=None))
   
def isseq(s): #it tests if a string is a sequence. Using the result from previous function it compares all smaller parts of the devided string to check if they are equal
    l=[int(p) for p in sorted(list(b(len(s))))[:-1]]
    for i in l:
        if len(set(s[k*i:i*(k+1)] for k in range(len(s)//i)))==1:
            return True
    return False

def longest(s, threshold): #the main function that returns the lenghtier sequense or a list of them if they are multiple, using a threshold as minimum length
    m=[]
    for i in range(len(s), threshold-1, -1):
        for k in range(len(s)-i+1):
            if isseq(s[k:k+i]):
                m.append([s[k:k+i], i, k, k+i-1])
        if len(m)>0:
            return m
    return False

Examples:

>>>s='ATTTTCCATGATGATGGST'
>>> longest(s, 1)
[['ATGATGATG', 9, 7, 15]]

>>> s='ATTTTCCATGATGATGGSTLWELWELWEGFRJGHIJH'
>>> longest(s, 1)
[['ATGATGATG', 9, 7, 15], ['LWELWELWE', 9, 19, 27]]


>>>s='ATTTTCCATGATGATGGSTWGTKWKWKWKWKWKWKWKWKWKWKWFRGWLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERFGTFRGFTRUFGFGRFGRGBHJ'
>>> longest(longs, 1)
[['LWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWER', 64, 48, 111]]

Upvotes: 2

Alex P
Alex P

Reputation: 1155

Here is a solution:

def longest_repeat(seq, threshold):
    results = []
    longest = threshold
    
    # starting position
    for i in range(len(seq)):
        
        # pattern period
        for p in range(1, (len(seq)-i)//2+1):
            # skip unecessary combinations
            if results != [] and results[-1][0] == i and results[-1][3] % p == 0: continue
            
            # max possible number of repetitions
            repetitions = len(seq)//p
            
            # position within the pattern's period
            for k in range(p):
                # get the max repetitions the k-th character in the period can support
                m = 1
                while i+k+m*p < len(seq) and seq[i+k] == seq[i+k+m*p]:
                    m += 1
                repetitions = min(m, repetitions)
                
                # check if we're already below the best result so far 
                if repetitions*p < longest:    break
            
            # save the result if it's good
            if repetitions > 1 and repetitions*p >= longest:
                # overwrite lesser results
                if repetitions*p > longest: results = []
                
                # store the current one (with ample information)
                results += [(i, seq[i:i+p], repetitions, repetitions*p)]
                longest = max(longest, repetitions*p)
    
    return results

The logic is that you run through each starting position in the sequence (i), you check every sensible pattern period (p) and for that combination you check if they result in a substring at least as good as the best one so far (or the threshold, if no result has been found yet).

The result is a list of tuples of the form (starting index, period string, repetitions, total length). Running your example

threshold = 5
seq = 'ATTTCCATGATGATG'

t = time.time()
results = longest_repeat(seq, threshold)
print("execution time :", time.time()-t)

for t in results:
    print(t)

we get

exec : 0.00010848045349121094
(6, 'ATG', 3, 9)

From there, it is trivial to get the full matched string (simply do period_string * repetitions)

For a random input of 700 characters, the execution time is ~6.8 seconds, compared to ~20.2 seconds using @IoaTzimas's answer.

Upvotes: 2

Related Questions