Reputation: 1861
Given a string "A B C a b B"
I want to match words that are repeated (regardless of case). Expected result would be matching "a" and "b" (last occurrences of A and B) OR "A" and "B" (first occurrences)
EDIT: I want to match only the first or the last occurrence of the word
I know this question could be better answered by spliting the string and count each token (lowering that case).
However, I'd like to try and formulate a regex to help me find those words, just for the sake of practice.
My first atempt was: (?=\b(\w+)\b.*\b(\1)\b)(\1)
However it matches the first A, first B and second b (A B b).
I was thinking to somehow use positive look-behind with negative look-ahead to fetch the last instances of the repeating word: (?<=.*(?!.*(\w+).*)\1.*)\b\1\b
(In my head is translates to "a word that had been matched before and won't match again")
Well, it doesn't work for me unfortunately.
Is it possible to use positive look-behind and negative look-ahead this way?
Could my regex be fixed?
I've tried to solve it in C#.
This is not homework
Upvotes: 4
Views: 1107
Reputation: 51330
Interesting puzzle. Here's my solution:
(\b\w+\b)(?:(?=.*?\b\1\b)|(?<=\b\1\b.*?\1))
The reasoning is as follows:
Match a word: (\b\w+\b)
Then either: (?:
...|
...)
(?=.*?\b\1\b)
Or it already occurred before: (?<=\b\1\b.*?\1)
That second \1
in the lookbehind matches the word that was just matched before. The first \1
is the real duplicate.
Answer for the edited question:
If you only want to match the first occurrence of a duplicated word, we can change the above pattern a bit:
(\b\w+\b)(?=.*?\b\1\b)(?<!\b\1\b.*?\1)
Now the logic is:
(\b\w+\b)
(?=.*?\b\1\b)
And make sure it didn't occur before: (?<!\b\1\b.*?\1)
(same thing than before except with a negative lookbehind)
Upvotes: 2