Super Sonic
Super Sonic

Reputation: 116

Regex to remove similar words by pattern

I have a list like this in Notepad++

V - Visitors  2009 - S01e11-12.torrent
V - Visitors (2009) S02e04.torrent
V - Visitors (2009) S01e01-12.torrent
V S02e02.torrent
V S02e05.torrent
Valentina S01e01-13.torrent
Valeria Medico Legale S01-02e01-16.torrent
Veep - Season 1 BDMux.torrent
Veep - Season 2 BDMux.torrent
Veep - Season 3.torrent
Veep - Season 4.torrent
Vegas S01e01-21.torrent
Velvet S01e13.torrent
Velvet S01e15.torrent
Vikings.S03E03.torrent
Vikings.S03E04.torrent
Vikings.S03E05.torrent
Velvet_S03e02.torrent
Velvet_S03e03.torrent
Velvet_S03e04.torrent

I want a regex to delete repeated first-second words lines (veep - veep) to have a final list like this

V - Visitors  2009 - S01e11-12.torrent
V S02e02.torrent
Valentina S01e01-13.torrent
Valeria Medico Legale S01-02e01-16.torrent
Veep - Season 1 BDMux.torrent
Vegas S01e01-21.torrent
Velvet S01e13.torrent

So if I have

Veep - Season 1 BDMux.torrent
Veep - Season 2 BDMux.torrent

I want only first line

Veep - Season 1 BDMux.torrent

Upvotes: 1

Views: 177

Answers (1)

Lars Fischer
Lars Fischer

Reputation: 10149

Do a regular expression find/replace like this:

  • Open Replace Dialog
  • Find What: ^([^ _.-]+[ _.-]+([^ _.-]++)?)(.*?\R)(\1.*?\R)+
  • Replace With: \1\3
  • check regular expression
  • click Replace or Replace All

Explanation

  • precondition is that the file is sorted
  • the first part ^([^ _.-]+[ _.-]+([^ _.-]++)?) deals with getting the first word on a line followed by the separator " ", "_", "." or "-".
    • the first word is everything not a separator
    • the second word (([^ _.-]++)?) is optional to accomodate for the velvet example
    • due to the use of parenthesis the first word, the separator and optional second word are captured into \1 and what follows up to and including the linebreak is cptured into \3 for later reuse
  • the (.*?\R) captures everything up to the linebreak (\R
  • the last parrt (\1.*?\R)+ matches all following lines that begin with whatever is captured in \1
  • the match spans all the lines, they are replace with \1\3 and that only reconstructs the first line, thus deleting the following line

Upvotes: 1

Related Questions