Reputation: 81

Regex, removing duplicate non interrupted strings

I recently tried making a regex for deleting strings which stand after each other without being interrupted by an other string, and then let only one string stay. My work so far : https://regex101.com/r/Cs0bmY/7 . It should work with all possible urls which maybe dont have www. before them or an other ending like .com or .nl etc The strings (list of urls) looks like this:

operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
amazon.de
fonts.gstatic.com
fonts.gstatic.com
fonts.gstatic.com
erovoyeurism.net
tugtechnologyandbusiness.com

The end result should look like this:

operator.livrareflori.md
amazon.de
fonts.gstatic.com
erovoyeurism.net
tugtechnologyandbusiness.com

You can see that the duplicate strings which are not interrupted by an other string are gone and only 1 result stays.

Upvotes: 1

Answers (4)

vks

Reputation: 67968

((?:https?://)?(?:www\.)?\S+\.\S+)\s(?=[\s\S]*\1)

You can try this.See demo.

https://regex101.com/r/Cs0bmY/11

Upvotes: 1

jaytea

Reputation: 1949

The trick is to capture the line and use a lookahead to verify that it exists later in the subject. This expression matches duplicates, and substituting with "" makes it keep the last occurrences:

(?s)^((?:https?://)?(?:www\.)?\S+\.\S+)\n(?=.*^\1$)

https://regex101.com/r/Cs0bmY/10

Upvotes: 1

Toto

Reputation: 91415

Using Notepad++, you can do:

Ctrl+H
Find what: ^(.+)$(?:\R\1)+
Replace with: $1
check Wrap around
check Regular expression
DO NOT CHECK . matches newline
Replace all

Explanation:

^(.+)$      : group 1, a whole line
(?:         : non capture group
    \R      : any kind of line break
    \1      : backreference to group 1
)+          : group must appear 1 or more times

Replacement:

$1          : content of group 1

Result for given example:

operator.livrareflori.md
amazon.de
fonts.gstatic.com
erovoyeurism.net
tugtechnologyandbusiness.com

Upvotes: 1

CertainPerformance

Reputation: 370769

You can match

^(.+)$(?:\n\1)+

thus capturing the first line, and matching subsequent duplicate lines, and then replace everything matched with the first capture group:

\1

(or the equivalent keyword for the first group in whatever environment you're in)

https://regex101.com/r/Cs0bmY/8

Upvotes: 1

Regex, removing duplicate non interrupted strings

Answers (4)

Related Questions