jor
jor

Reputation: 158

Regex: Finding duplicates in string

https://regex101.com/r/kBxa7R/2

I have following regex: \b(\w+) \b(?=.*\b\1)

I need to remove all duplicates in string. So for instance:

Mike Tyson 1. Street 1234 Vietnam ML(12534/97632) Mike Tyson 1234 1. Street Vietnam ML(12534/97632)

should results in:

Mike Tyson 1. Street 1234 Vietnam ML(1234/97632)

I already know why it fails, but I do not know how to fix it. I only look for \w+ and therefore "1." or "ML(156746/615893)" is not beeing found. But when I add these missing characters manually or replace the whole statement by .+ weird stuff is going on.

Can someone help?

Upvotes: 2

Views: 78

Answers (1)

anubhava
anubhava

Reputation: 785128

You may use this regex:

(?<!\S)(\S+)\h+(?=(?:\S+\h+)*?\1(?!\S))

Updated RegEx Demo

RegEx Details:

  • (?<!\S): Lookbehind to assert that we don't have a non-space at previous position
  • (\S+): Match 1+ non-whitespace character and capture this in group #1
  • \h+: Match 1+ whitespace
  • (?=: Start positive lookahead
    • (?:\S+\h+)*?: Lazily match 0 more groups where each group consists of 1+ non-whitespace characters followed by 1+ space
    • \1: Back reference for group #1
    • (?!\S): Must not be followed by a non-whitespace to avoid partial matches
  • ): End positive lookahead

Casimir has made a very good suggestion in comments of using verb (*SKIP) for PCRE flavors as well. This appears to be more efficient as per regex101 website:

~(\S+) \h+ (*SKIP) (?= (?>\S+\h+)*? \1 (?!\S) )~x

Upvotes: 3

Related Questions