Reputation: 10809
I am looking for an algorithm (possibly implemented in Python) able to find the most REPETITIVE sequence in a string. Where for REPETITIVE, I mean any combination of chars that is repeated over and over without interruption (tandem repeat).
The algorithm I am looking for is not the same as the "find the most common word" one. In fact, the repetitive block doesn't need to be the most common word (substring) in the string.
For example:
s = 'asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs'
> f(s)
'UBAUBAUBAUBAUBA' #the "most common word" algo would return 'BA'
Unfortunately, I have no idea on how to tackle this. Any help is very welcome.
UPDATE
A little extra example to clarify that I want to be returned the sequence with the most number of repetition, whatever its basic building block is.
g = 'some noisy spacer'
s = g + 'AB'*5 + g + '_ABCDEF'*2 + g + 'AB'*3
> f(s)
'ABABABABAB' #the one with the most repetitions, not the max len
Examples from @rici:
s = 'aaabcabc'
> f(s)
'abcabc'
s = 'ababcababc'
> f(s)
'ababcababc' #'abab' would also be a solution here
# since it is repeated 2 times in a row as 'ababcababc'.
# The proper algorithm would return both solutions.
Upvotes: 12
Views: 2600
Reputation: 26056
Here is a brute force algorithm that I wrote. Maybe it will be useful:
def find_most_repetitive_substring(string):
max_counter = 1
position, substring_length, times = 0, 0, 0
for i in range(len(string)):
for j in range(len(string) - i):
counter = 1
if j == 0:
continue
while True:
if string[i + counter * j: i + (counter + 1) * j] != string[i: i + j] or i + (counter + 1) * j > len(string):
if counter > max_counter:
max_counter = counter
position, substring_length, times = i, j, counter
break
else:
counter += 1
return string[position: position + substring_length * times]
Upvotes: 1
Reputation: 140
What you are searching for is an algorithm to find the 'largest' primitive tandem repeat in a string. Here is a paper describing a linear time algorithm to find all tandem repeats in a string and by extension all primitive tandem repeats. Gusfield. Linear Time Algorithms for Finding and Representing all Tandem Repeats in a String
Upvotes: 4
Reputation: 186
Here is the solution based on ((\w+?)\2+)
regex but with additional improvements:
import re
from itertools import chain
def repetitive(sequence, rep_min_len=1):
"""Find the most repetitive sequence in a string.
:param str sequence: string for search
:param int rep_min_len: minimal length of repetitive substring
:return the most repetitive substring or None
"""
greedy, non_greedy = re.compile(r'((\w+)\2+)'), re.compile(r'((\w+?)\2+)')
all_rep_seach = lambda regex: \
(regex.search(sequence[shift:]) for shift in range(len(sequence)))
searched = list(
res.groups()
for res in chain(all_rep_seach(greedy), all_rep_seach(non_greedy))
if res)
if not sequence:
return None
cmp_key = lambda res: res[0].count(res[1]) if len(res[1]) >= rep_min_len else 0
return max(searched, key=cmp_key)[0]
You can test it like so:
def check(seq, expected, rep_min_len=1):
result = repetitive(seq, rep_min_len)
print('%s => %s' % (seq, result))
assert result == expected, expected
check('asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs', 'UBAUBAUBAUBAUBA')
check('some noisy spacerABABABABABsome noisy spacer_ABCDEF_ABCDEFsome noisy spacerABABAB', 'ABABABABAB')
check('aaabcabc', 'aaa')
check('aaabcabc', 'abcabc', rep_min_len=2)
check('ababcababc', 'ababcababc')
check('ababcababcababc', 'ababcababcababc')
Key features:
((\w+)\2+)
and non-greedy ((\w+)\2+?)
regex;Upvotes: 7
Reputation: 92854
With combination of re.findall()
(using specific regex patten) and max()
functions:
import re
# extended sample string
s = 'asdfewfUBAUBAUBAUBAUBAasdkjnfencsADADADAD sometext'
def find_longest_rep(s):
result = max(re.findall(r'((\w+?)\2+)', s), key=lambda t: len(t[0]))
return result[0]
print(find_longest_rep(s))
The output:
UBAUBAUBAUBAUBA
The crucial pattern:
((\w+?)\2+)
:
(....)
- the outermost captured group which is the 1st captured group(\w+?)
- any non-whitespace character sequence enclosed into the 2nd captured group; +?
- quantifier, matches between one and unlimited times, as few times as possible, expanding as needed\2+
- matches the same text as most recently matched by the 2nd capturing groupUpvotes: 13