Reputation: 158
https://regex101.com/r/kBxa7R/2
I have following regex: \b(\w+) \b(?=.*\b\1)
I need to remove all duplicates in string. So for instance:
Mike Tyson 1. Street 1234 Vietnam ML(12534/97632) Mike Tyson 1234 1. Street Vietnam ML(12534/97632)
should results in:
Mike Tyson 1. Street 1234 Vietnam ML(1234/97632)
I already know why it fails, but I do not know how to fix it. I only look for \w+
and therefore "1.
" or "ML(156746/615893)
" is not beeing found. But when I add these missing characters manually or replace the whole statement by .+
weird stuff is going on.
Can someone help?
Upvotes: 2
Views: 78
Reputation: 785128
You may use this regex:
(?<!\S)(\S+)\h+(?=(?:\S+\h+)*?\1(?!\S))
RegEx Details:
(?<!\S)
: Lookbehind to assert that we don't have a non-space at previous position(\S+)
: Match 1+ non-whitespace character and capture this in group #1\h+
: Match 1+ whitespace(?=
: Start positive lookahead
(?:\S+\h+)*?
: Lazily match 0 more groups where each group consists of 1+ non-whitespace characters followed by 1+ space\1
: Back reference for group #1(?!\S)
: Must not be followed by a non-whitespace to avoid partial matches)
: End positive lookaheadCasimir has made a very good suggestion in comments of using verb (*SKIP)
for PCRE flavors as well. This appears to be more efficient as per regex101 website:
~(\S+) \h+ (*SKIP) (?= (?>\S+\h+)*? \1 (?!\S) )~x
Upvotes: 3