Reputation: 51
I have users entering blocks of text and I'm trying to prevent them from repeating a phrase more than, say, 5 times. So this would be fine:
I like fish very much I like fish very much I like fish very much
so would this:
Marshmallows are yummy. Marshmallows are yummy. Marshmallows are yummy.
But this would not be:
I like fish very much I like fish very much I like fish very much I like fish very much I like fish very much I like fish very much I like fish very much I like fish very much
nor this:
Marshmallows are yummy. Marshmallows are yummy. Marshmallows are yummy. Marshmallows are yummy. Marshmallows are yummy. Marshmallows are yummy. Marshmallows are yummy. Marshmallows are yummy. Marshmallows are yummy. Marshmallows are yummy.
Ideally, it would also catch it even if it was entered like this:
I like fish very much
I like fish very much
I like fish very much
I like fish very much
I like fish very much
I like fish very much
I tried:
\b(\S.*\S)[ ,.]*\b(\1){5}
But it doesn't always work, depending on the phrase length and only seems to work if each sentence is ended with a period.
Any ideas?
Upvotes: 3
Views: 264
Reputation: 371233
Here's one possibility:
(\b\w.{3,49})\1{4}
It captures between 2 and 50 characters (starting with a word character) in a group, and checks for if that group is repeated at least 5 times in a row.
https://regex101.com/r/tS6kHF/2
If the regex passes, there is some repeated phrase.
That said, this may not be a great idea, especially for large input strings - as you can see on the link, it takes a very large number of steps, because for each character in the input (eg, starting with "hello"), it has to find the corresponding substring of length 2 ("he") and check that it's not repeated, then find "hel" and what follows, then find "hell" and what follows, and so on, 50 times. Then, it starts on the next character, "e": "el", then "ell", then "ello", etc. (You do need an upper limit, like 50 characters, or something - otherwise, the computation time goes way up, eg 8k steps to 74k steps)
Depending on the situation, it may be computationally expensive - might be better to use another method to programatically find repeating substrings.
Upvotes: 2