frumbert
frumbert

Reputation: 2427

regex to match repeating strings?

I have a dataset of lines that, due to a code bug has strings duplicated one or more times. The data starts with a capital, there are often multiple words then the string repeats. Some lines are ok and don't have repeating test. For instance, the data could be

The quick brown ox jumps over the lazy dogThe quick brown fox jumps over the lazy dog
ApplesApplesApples
IBM AT Computer
Lamp ShadeLamp Shade
OrangesOranges
I am a Potato

I have found multiple regular expressions for finding repeat words that stop at a predefined boundary \b or \w - that's pretty easy.

Finding repeating phrases of static length (e.g. two words that repeat, as in i am i am a potato) where there is a built-in boundary condition such as \w is also relatively easy. I have found examples of that such as \b(\w+(?:\s*\w*))\s+\1\b (demo https://regex101.com/r/4UIrxu/2). It fails if there are three repeats as in i am i am i am a potato and will only find the first occurence.

My phrases contain one or more words so the above phrase matcher won't work.

Is it possible to tell an expression that its boundary is a conditional that I make up - like a lower case letter followed by an uppercase letter (as in the T in dogThe) - which I can do with \B[a-z][A-Z]\B - that can then be used as a marker to test to see if the previous portion was repeated? I wasn't able to modify the repeating phrase pattern with this boundary condition, but maybe it is still possible.

Upvotes: 0

Views: 961

Answers (1)

Chris Lear
Chris Lear

Reputation: 6752

This is very simple, but might provide a start:

/([A-Z].*)\1{1,}

See https://regex101.com/r/ynfuCO/1

This introduces the boundary condition:

/(?:^|(?<=[a-z]))([A-Z].*)\1{1,}

I've included start-of-line as well as a lowercase/Uppercase boundary, because that seems to match your requirements. See https://regex101.com/r/PBFDPY/2

The (?<=[a-z]) part is a positive look-behind (see eg https://www.regular-expressions.info/lookaround.html), which checks for a lower-case letter. You might need to adapt the character classes (I've just used [a-z] for simplicity, but often that's not adequate).

Upvotes: 3

Related Questions