Reputation: 3098
Say I have a string, like:
where is mummy where is daddy
I want to replace any set of repeating substrings with empty strings - so in this case the where
and is
elements would be removed and the resulting string would be:
mummy daddy
I was wondering if there was any single regex that could achieve this. The regex I tried (which doesn't work) looked like the following:
/(\w+)(?=.*)\1/gi
Where the first capture group is any set of word characters, the second is a positive look ahead to any set of characters (in order to prevent those characters from being included in the result) and then the \1
is a backreference to the first matched substring.
Any help would be great. Thanks in advance!
Upvotes: 7
Views: 4903
Reputation: 626861
Your regex does not work because the \w+
is not restricted with word boundaries and the \1
backreference is tried to match right after the "original" word, which is almost never true.
You need to first get the words that are dupes, and then build a RegExp to match them all with optional whitespace (or punctuation, etc. - adjust the pattern later) and replace with an empty string:
var re = /(\b\w+\b)(?=.*\b\1\b)/gi; // Get the repeated whole words
var str = 'where is mummy where is daddy';
var patts = str.match(re); // Collect the matched repeated words
var res = str.replace(RegExp("\\s*\\b(?:" + patts.join("|") +")\\b", "gi"), ""); // Build the pattern for replacing all found words
document.body.innerHTML = res;
The first pattern is (\b\w+\b)(?=.*\b\1\b)
:
(\b\w+\b)
- match and capture into Group 1 a whole word consisting of [A-Za-z0-9_]
characters(?=.*\b\1\b)
- make sure this value captured into Group 1 is repeated somewhere to the right of the current location (not necessarily right after the word). If the string is multiline, use [\s\S]
instead of the dot. To make sure we match original and dupe words as whole words, \b
word boundaries should be used around both \w+
and \1
.The second pattern will look different each time, but in your current scenario, it will be /\s*\b(?:where|is)\b/gi
:
\s*
- zero or more whitepsace\b(?:where|is)\b
- a whole word from the alternation group (?:...|...)
: either where
or is
(case-insensitive due to /i
modifier).Upvotes: 11