Agos FS
Agos FS

Reputation: 127

Extract All Unique Lines

I have text files with repeated exact lines of text, but I only want one of each. Imagine this text file:

AAAAA
AAAAA
AAAAA
BB
BBBBB
BBBBB
CCC
CCC
CCC

I would only need the following four lines from it:

AAAAA
BB
BBBBB
CCC

I'm using a text editor (EmEditor or Notepad++), that supports RegEx, not a programming language, so I must use a purely Regular Expression.

Any help?

EDIT: I checked the other thread that hsz mentioned and I'd like to make it clear that this one is not the same. Although both need to remove duplicate lines, the way to achieve it is different. I need pure RegEx, but the best answer from the other thread relies on a specific Notepad++ plug-in (which doesn't even come with it any more), so it's not even a regex solution. The second case there, is a regex and it does work on Notepad++, but not on EmEditor at all, which I also need. So I don't think my question is a repetition of that one, although that link is useful, an so I thank hsz for it.

Upvotes: 12

Views: 8881

Answers (4)

zx81
zx81

Reputation: 41838

Two nearly identical options:

Match All Lines That Are Not Repeated

(?sm)(^[^\r\n]+$)(?!.*^\1$)

The lines will be matched, but to extract them, you really want to replace the other ones.

Replace All Repeated Lines

This will work better in Notepad++:

Search: (?sm)(^[^\r\n]*)[\r\n](?=.*^\1)

Replace: empty string

  • (?s) activates DOTALL mode, allowing the dot to match across lines
  • (?m) turns on multi-line mode, allowing ^ and $ to match on each line
  • (^[^\r\n]*) captures a line to Group 1, i.e.
  • The ^ anchor asserts that we are at the beginning of the string
  • [^\r\n]* matches any chars that are not newline chars
  • [\r\n] matches the newline chars
  • The lookahead (?!.*^\1$) asserts that we can match any number of characters .*, then...
  • ^\1$ the same line as Group 1

Upvotes: 14

hwnd
hwnd

Reputation: 70732

You can use the following regular expression to remove both repeated and empty lines.

Find: ^(.*)(\r?\n\1)+$
Replace: \1

Upvotes: 4

Braj
Braj

Reputation: 46861

I don't know will it work in Notepad++ or EmEditor but working fine in PHP/JavaScript/Python with substitution.

^(.+)(\n(\1))*$

Here is Demo

Simply copy your text and get the final result from the link that I shared you.

Upvotes: -1

Alexander Gelbukh
Alexander Gelbukh

Reputation: 2240

Provided that the equal lines go in groups, that is, AAAA AAAA BBBB BBBB and not AAAA BBBB AAAA BBBB, in Perl notation, the following works:

s/(^.*$)(\r?\n\1$)*/$1/gm;

which means substitute /(^.$)(\r?\n\1$)/ for $1 globally and in multiline mode (^ and $ match internal \n).

This expression means that any complete line followed by any number of equal lines is substituted by a single occurrence.

See help on your particular editor for how to apply such a regex.

Upvotes: 0

Related Questions