Find and KEEP all DUPLICATE lines (instead of unique lines) in a text file

Question

I am aiming to identify and keep DUPLICATE, TRIPLICATE, etc. lines, i.e., all lines that occur more than once in Notepad++? In other words, how can I delete all unique lines only?

For example, here are seven (7) separate lists and the desired true duplicate lines of each lists (shown as 7 columns, regard each column as an individual list or file!). (The lists here are shown side by side only to save space, in real life, each of the 7 lists occurs alone and independently from the others and are separate files!)

list1  list2  list3  list4  list5  list6  list7
1      0      0      0      0      0      0
2      1      1      1      1      1      1
3      2      2      2      2      2      2
4      3      3      3      3      3      3
4      4      4      4      4      4      4
4      4      4      4      4      4      4
5      4      4      4      4      4      4
6      5      5      5      5      5      5
7      5      5      5      5      5      5
8      6      6      6      6      6      6
9      6      6      6      6      6      6
abc    7      7      7      7      7      7
abd    8      8      8      8      8      8
abd    9      9      9      9      9      9
abe              9      9      9      9
                               99     99
                                          

[Lines of multiple occurence of above lists:]         
4      4      4      4      4      4      4
4      4      4      4      4      4      4
4      4      4      4      4      4      4
abd    5      5      5      5      5      5
abd    5      5      5      5      5      5
       6      6      6      6      6      6
       6      6      6      6      6      6
                     9      9      9      9
                     9      9      9      9

There are many solutions to eliminate duplicates (e.g., TextFX; notepad++ delete duplicate and original lines to keep unique lines), I can not find solutions to keep duplicates only.

((.*)\R(\2\R)+)*\K.+\R @Lars Fischer: This script works nearly OK, except the last entry of the (presorted) list needs to be unique line followed by a empty line. One (suboptimal) workaround is to insert an artificial (helper) unique line (e.g., zzz) followed by an empty line as the last two lines.

(END OF QUESTION)

UPDATE 3: This question is reposted per stackoverflow "ask a new question" instruction. (@AdrianHHH, @B. Desai, @Paolo Forgia, @greg-449, @Erik von Asmuth draw the incorrect conclusion that this question is a duplicate of notepad++ delete duplicate and original lines to keep unique lines. This question is definitely not a duplicate of the one @AdrianHHH et al quotes.

UPDATE 2: @AdrianHHH This question is not less "broad" (in fact, one can hardly be more specific) or less researched than other Notepad++ questions, including the one https://stackoverflow.com/questions/29303148 cited (wrongly) by @AdrianHHH et al. as the same question.

UPDATE: @AdrianHHH, @B. Desai, @Paolo Forgia, @greg-449, @Erik von Asmuth This questions is different from: https://stackoverflow.com/questions/29303148 beacuse Q 29303148 is (i) neither asking how to identify and keep only the lines of multiple occurrence, (ii) neither there is a solution provided in the answers for that. Q 29303148 asks "...I just need the unique lines."

Lars Fischer · Accepted Answer

Here is a solution based on regular Expressions and bookmarks, it works for a sorted file (i.e. each duplicated line is followed by its duplicates):

Open the Mark Dialog (Search -> Mark ....)
click Clear all Marks on the right
check Bookmark line
check Wrap aound
Find What: ((.*)\R(\2\R?)+)*\K.*
Check regular expression and uncheck . matches newline
Mark All
Click Close
Search -> Bookmark -> Remove Bookmarked Lines

Explanation

The regular expression is made up of three parts:

((.*)\R(\2\R?)+)* : this is an optional block of duplicates consisting of one ore more line blocks
- the outher ( ... )* matches zero or more such blocks of duplicated lines (if in your example the three 4 would be followed by two 5 we will need a concept of sequences of duplicate blocks)
- (.*)\R(\2\R?)+: \2 references the content of (.*): this are all duplicates of one line
- the second \R is an optional ( due to the ?) linebreak. Thus it is possible to match a duplicate in the last line of the file if that line does not end with a linebreak
If there is a block of duplicated lines after the cursor position from which you start, this will match it.
now \K discards what we have matched so far (the duplicates) and "puts the cursor" before the first unique line
.* matches the next (unique) line and bookmarks it

Using Mark All we bookmark all such unique lines, so that we can remove them using the Entry from the Search -> Bookmark menu.

Find and KEEP all DUPLICATE lines (instead of unique lines) in a text file

Answers (1)

Related Questions