n34_panda
n34_panda

Reputation: 2677

Notepad++ Regex to find string on a line and remove duplicates of the exact string

Anyone know how to match a random string and then remove and re-occurences of the same string on each line in a file.

Essentially I have a file:

00101  blah 0000202 thisisasentencethisisasentence 99929
00102  blah 0000202 thisisasentenc1thisisasentenc1 999292

I want to remove the duplicate sentence so it returns:

00101  blah 0000202 thisisasentence 99929
00102  blah 0000202 thisisasentenc1 999292

The width isn't fixed or anything like that.

I think this is close but I don't understand regex well and it highlights everything in the file except the last line - correctly finding the duplicate but only once. Removing duplicate strings/words(not lines) using RegEx(notepad++)

Note I can also use the following to identify which parts of each line is duplicated - it highlights the duplicated values (thisisasentencethisisasentence) but I don't know how to split it

(.{5,})\1

Any help would be appreciated, thanks.

EDIT I can reformat to create comma delimited (to some extent): (note with this, there is a chance a comma exists in the duplicated string but don't worry about that)

00101,blah,0000202,thisisasentencethisisasentence,99929
00102,blah,0000202,thisisasentenc1thisisasentenc1,999292

Upvotes: 2

Views: 1152

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

You can use this pattern in notepad++ with an empty string as replacement:

^(?>\S+[^\S\n]+){3,}?(\S+?)\K\1(?!\S)

demo

pattern details:

^        # anchors for the start of the line (by default in notepad++)
(?>            # atomic group: a column and the following space
    \S+          # all that is not a white-space character 
    [^\S\n]+     # white-spaces except newlines
){3,}?         # repeat 3 or more times (non-greedy) until you find
(\S+?)\K\1(?!\S)  # a column with a duplicate

details of the last subpattern:

(\S+?)   # capture one or more non-white characters
         # (non-greedy: until \1(?!\S) succeeds)
\K       # discard all previous characters from whole match result
\1       # back-reference to the capture group 1
(?!\S)   # ensure that the end of the column is reached

Note: using {5,} instead of + in \S+? (so \S{5,}?) is a good idea, if you are sure that columns contain at least five characters.

Upvotes: 1

elixenide
elixenide

Reputation: 44833

You say you are happy with what (.{5,})\1 matches. So, just use $1 as the replacement value. It will automatically replace the repeated part and its copy with a single copy of the text.

Upvotes: 0

Related Questions