Regex to remove similar words by pattern

Question

I have a list like this in Notepad++

V - Visitors  2009 - S01e11-12.torrent
V - Visitors (2009) S02e04.torrent
V - Visitors (2009) S01e01-12.torrent
V S02e02.torrent
V S02e05.torrent
Valentina S01e01-13.torrent
Valeria Medico Legale S01-02e01-16.torrent
Veep - Season 1 BDMux.torrent
Veep - Season 2 BDMux.torrent
Veep - Season 3.torrent
Veep - Season 4.torrent
Vegas S01e01-21.torrent
Velvet S01e13.torrent
Velvet S01e15.torrent
Vikings.S03E03.torrent
Vikings.S03E04.torrent
Vikings.S03E05.torrent
Velvet_S03e02.torrent
Velvet_S03e03.torrent
Velvet_S03e04.torrent

I want a regex to delete repeated first-second words lines (veep - veep) to have a final list like this

V - Visitors  2009 - S01e11-12.torrent
V S02e02.torrent
Valentina S01e01-13.torrent
Valeria Medico Legale S01-02e01-16.torrent
Veep - Season 1 BDMux.torrent
Vegas S01e01-21.torrent
Velvet S01e13.torrent

So if I have

Veep - Season 1 BDMux.torrent
Veep - Season 2 BDMux.torrent

I want only first line

Veep - Season 1 BDMux.torrent

Lars Fischer · Accepted Answer

Do a regular expression find/replace like this:

Open Replace Dialog
Find What: ^([^ _.-]+[ _.-]+([^ _.-]++)?)(.*?\R)(\1.*?\R)+
Replace With: \1\3
check regular expression
click Replace or Replace All

Explanation

precondition is that the file is sorted
the first part ^([^ _.-]+[ _.-]+([^ _.-]++)?) deals with getting the first word on a line followed by the separator " ", "_", "." or "-".
- the first word is everything not a separator
- the second word (([^ _.-]++)?) is optional to accomodate for the velvet example
- due to the use of parenthesis the first word, the separator and optional second word are captured into \1 and what follows up to and including the linebreak is cptured into \3 for later reuse
the (.*?\R) captures everything up to the linebreak (\R
the last parrt (\1.*?\R)+ matches all following lines that begin with whatever is captured in \1
the match spans all the lines, they are replace with \1\3 and that only reconstructs the first line, thus deleting the following line

Regex to remove similar words by pattern

Answers (1)

Related Questions