Reputation: 711
What is a suitable regular expression for finding all the instances of all the duplicate words in whole text. I have seen many available solutions but they are not working well in my case. For instance, the solutions presented here.
The solution that comes closest to my case is this regular expression:
(\b\w+\b)(?=[\s\S]*\b\1\b)
However, this regular does not match the last instances of the matched words. My requirement is to match all the instances in the whole text.
Please note that my text may contain multiple string separated by \n
. There may exist punctuation and special characters in the text as well. Please note that the duplicate words are not necessarily contiguous.
Consider the following text for testing. This text is extracted from this website.
RegExr was created by gskinner.com.
Edit the Expression & Text to see matches. Roll over matches or the expression for details. PCRE & JavaScript flavors of RegEx are supported. Validate your expression with Tests mode.
The side bar includes a Cheatsheet, full Reference, and Help. You can also Save & Share with the Community and view patterns you create or favorite in My Patterns. Explore results with the Tools below. Replace & List output custom results. Details lists capture groups. Explain describes your expression in plain English.
Upvotes: 1
Views: 76
Reputation: 1086
Try this:
/(?=\b(\w+)\b)(?:(?=[\s\S]+\b\1\b)|(?<=\b\1\b[\s\S]*))\w+/gi
(?=\b(\w+)\b)
a position where it is followed by:
\b
word boundary.
(\w+)
First capturing group \1
:
\w+
one or more word character\b
word boundary.
(?:(?=[\s\S]+\b\1\b)|(?<=\b\1\b[\s\S]*))
non-capturing group.
(?=[\s\S]+\b\1\b)
a position where it is followed by:
[\s\S]+
one or more character.\b
word boundary.\1
the value from the first capturing group.\b
word boundary.|
OR
(?<=\b\1\b[\s\S]*)
a position where it is preceded by:
\b
word boundary\1
the value from the first capturing group.\b
word boundary.[\s\S]*
zero or more characters.\w+
one or more word character.
See regex demo
Upvotes: 1