Lisa K
Lisa K

Reputation: 51

REGEX for Phrase Repeated n Times?

I have users entering blocks of text and I'm trying to prevent them from repeating a phrase more than, say, 5 times. So this would be fine:

I like fish very much I like fish very much I like fish very much

so would this:

Marshmallows are yummy. Marshmallows are yummy. Marshmallows are yummy.

But this would not be:

I like fish very much I like fish very much I like fish very much I like fish very much I like fish very much I like fish very much I like fish very much I like fish very much

nor this:

Marshmallows are yummy. Marshmallows are yummy. Marshmallows are yummy. Marshmallows are yummy. Marshmallows are yummy. Marshmallows are yummy. Marshmallows are yummy. Marshmallows are yummy. Marshmallows are yummy. Marshmallows are yummy.

Ideally, it would also catch it even if it was entered like this:

I like fish very much
I like fish very much
I like fish very much
I like fish very much
I like fish very much
I like fish very much

I tried:

\b(\S.*\S)[ ,.]*\b(\1){5}

But it doesn't always work, depending on the phrase length and only seems to work if each sentence is ended with a period.

Any ideas?

Upvotes: 3

Views: 264

Answers (1)

CertainPerformance
CertainPerformance

Reputation: 371233

Here's one possibility:

(\b\w.{3,49})\1{4}

It captures between 2 and 50 characters (starting with a word character) in a group, and checks for if that group is repeated at least 5 times in a row.

https://regex101.com/r/tS6kHF/2

If the regex passes, there is some repeated phrase.

That said, this may not be a great idea, especially for large input strings - as you can see on the link, it takes a very large number of steps, because for each character in the input (eg, starting with "hello"), it has to find the corresponding substring of length 2 ("he") and check that it's not repeated, then find "hel" and what follows, then find "hell" and what follows, and so on, 50 times. Then, it starts on the next character, "e": "el", then "ell", then "ello", etc. (You do need an upper limit, like 50 characters, or something - otherwise, the computation time goes way up, eg 8k steps to 74k steps)

Depending on the situation, it may be computationally expensive - might be better to use another method to programatically find repeating substrings.

Upvotes: 2

Related Questions