user3026965
user3026965

Reputation: 703

Find and KEEP all DUPLICATE lines (instead of unique lines) in a text file

I am aiming to identify and keep DUPLICATE, TRIPLICATE, etc. lines, i.e., all lines that occur more than once in Notepad++? In other words, how can I delete all unique lines only?

For example, here are seven (7) separate lists and the desired true duplicate lines of each lists (shown as 7 columns, regard each column as an individual list or file!). (The lists here are shown side by side only to save space, in real life, each of the 7 lists occurs alone and independently from the others and are separate files!)

list1  list2  list3  list4  list5  list6  list7
1      0      0      0      0      0      0
2      1      1      1      1      1      1
3      2      2      2      2      2      2
4      3      3      3      3      3      3
4      4      4      4      4      4      4
4      4      4      4      4      4      4
5      4      4      4      4      4      4
6      5      5      5      5      5      5
7      5      5      5      5      5      5
8      6      6      6      6      6      6
9      6      6      6      6      6      6
abc    7      7      7      7      7      7
abd    8      8      8      8      8      8
abd    9      9      9      9      9      9
abe           <CR>   9      9      9      9
                            <CR>   99     99
                                          <CR>

[Lines of multiple occurence of above lists:]         
4      4      4      4      4      4      4
4      4      4      4      4      4      4
4      4      4      4      4      4      4
abd    5      5      5      5      5      5
abd    5      5      5      5      5      5
       6      6      6      6      6      6
       6      6      6      6      6      6
                     9      9      9      9
                     9      9      9      9

There are many solutions to eliminate duplicates (e.g., TextFX; notepad++ delete duplicate and original lines to keep unique lines), I can not find solutions to keep duplicates only.

((.*)\R(\2\R)+)*\K.+\R @Lars Fischer: This script works nearly OK, except the last entry of the (presorted) list needs to be unique line followed by a <CR> empty line. One (suboptimal) workaround is to insert an artificial (helper) unique line (e.g., zzz) followed by an empty line <CR> as the last two lines.

(END OF QUESTION)


UPDATE 3: This question is reposted per stackoverflow "ask a new question" instruction. (@AdrianHHH, @B. Desai, @Paolo Forgia, @greg-449, @Erik von Asmuth draw the incorrect conclusion that this question is a duplicate of notepad++ delete duplicate and original lines to keep unique lines. This question is definitely not a duplicate of the one @AdrianHHH et al quotes. History.

UPDATE 2: @AdrianHHH This question is not less "broad" (in fact, one can hardly be more specific) or less researched than other Notepad++ questions, including the one https://stackoverflow.com/questions/29303148 cited (wrongly) by @AdrianHHH et al. as the same question.

UPDATE: @AdrianHHH, @B. Desai, @Paolo Forgia, @greg-449, @Erik von Asmuth This questions is different from: https://stackoverflow.com/questions/29303148 beacuse Q 29303148 is (i) neither asking how to identify and keep only the lines of multiple occurrence, (ii) neither there is a solution provided in the answers for that. Q 29303148 asks "...I just need the unique lines."

Upvotes: 10

Views: 11156

Answers (1)

Lars Fischer
Lars Fischer

Reputation: 10149

Here is a solution based on regular Expressions and bookmarks, it works for a sorted file (i.e. each duplicated line is followed by its duplicates):

  • Open the Mark Dialog (Search -> Mark ....)
  • click Clear all Marks on the right
  • check Bookmark line
  • check Wrap aound
  • Find What: ((.*)\R(\2\R?)+)*\K.*
  • Check regular expression and uncheck . matches newline
  • Mark All
  • Click Close
  • Search -> Bookmark -> Remove Bookmarked Lines

Explanation

The regular expression is made up of three parts:

  • ((.*)\R(\2\R?)+)* : this is an optional block of duplicates consisting of one ore more line blocks

    • the outher ( ... )* matches zero or more such blocks of duplicated lines (if in your example the three 4 would be followed by two 5 we will need a concept of sequences of duplicate blocks)
    • (.*)\R(\2\R?)+: \2 references the content of (.*): this are all duplicates of one line
    • the second \R is an optional ( due to the ?) linebreak. Thus it is possible to match a duplicate in the last line of the file if that line does not end with a linebreak

    If there is a block of duplicated lines after the cursor position from which you start, this will match it.

  • now \K discards what we have matched so far (the duplicates) and "puts the cursor" before the first unique line

  • .* matches the next (unique) line and bookmarks it

Using Mark All we bookmark all such unique lines, so that we can remove them using the Entry from the Search -> Bookmark menu.

Upvotes: 16

Related Questions