merkaba
merkaba

Reputation: 31

Remove Similar Rows in Notepad++

I have a file with 225000 rows that contains a bunch of similar lines. I'm looking to remove all of the similar lines only keeping the first for each of it's "type". Example is below.

I'd like for a file that looks like this:

./ACT_HERE_REPORT_MEMO_APPROVED_20180510_083000.log.gz
./ACT_HERE_REPORT_MEMO_APPROVED_20180512_083000.log.gz
./ACT_HERE_REPORT_MEMO_APPROVED_20180513_083000.log.gz
./ACT_HERE_REPORT_MEMO_APPROVED_20180515_083000.log.gz
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180326.xls
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180327.xls
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180328.xls
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180329.xls
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180331.xls
./Archive/20150919-084501.SOMETHING
./Archive/20150922-084501.SOMETHING
./Archive/20150923-084500.SOMETHING
./Archive/20150924-084500.SOMETHING
./TEST/TEST.20170310.20170310-181017.txt.gz
./TEST/TEST.20170310.20170310-201023.txt.gz
./TEST/TEST.20170313.20170313-011035.txt.gz
./TEST/TEST.20170313.20170313-024006.txt.gz
./TEST/TEST.20170313.20170313-041018.txt.gz
./TEST/TEST.20180402-011024.log.gz
./TEST/TEST.20180402-011200.log.gz
./TEST/TEST.20180402-061113.log.gz
./TEST/TEST.20180402-081013.log.gz
./TEST/TEST.20180402-101012.log.gz

To end up like this:

./ACT_HERE_REPORT_MEMO_APPROVED_20180510_083000.log.gz
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180326.xls
./Archive/20150919-084501.SOMETHING
./TEST/TEST.20170310.20170310-181017.txt.gz
./TEST/TEST.20180402-011024.log.gz

Upvotes: 3

Views: 49

Answers (1)

Toto
Toto

Reputation: 91518

  • Ctrl+H
  • Find what: ((^.+?)[-_.\d]+(\..+\R))(?:\2[-_.\d]+\3)+
  • Replace with: $1
  • check Wrap around
  • check Regular expression
  • UNCHECK . matches newline
  • Replace all

Explanation:

(                   # start group 1
  (                 # start group 2
    ^               # beginning of line
    .+?             # 1 or more any character but newline, not greedy
  )                 # end group 2
  [-_.\d]+          # 1 or more hyphen, underscore, dot or digit
  (                 # start group 3
    \.              # a dot
    .+              # 1 or more any character
    \R              # any kind of linebreak
  )                 # end group 3
)                   # end group 1
(?:                 # non capture group
  \2                # backreference to group 2
  [-_.\d]+          # 1 or more hyphen, underscore, dot or digit
  \3                # backreference to group 3
)+                  # end group, must appear 1 or more times

Result for given example:

./ACT_HERE_REPORT_MEMO_APPROVED_20180510_083000.log.gz
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180326.xls
./Archive/20150919-084501.SOMETHING
./TEST/TEST.20170310.20170310-181017.txt.gz
./TEST/TEST.20180402-011024.log.gz

Screen capture:

enter image description here

Upvotes: 5

Related Questions