Reputation: 2677
Anyone know how to match a random string and then remove and re-occurences of the same string on each line in a file.
Essentially I have a file:
00101 blah 0000202 thisisasentencethisisasentence 99929
00102 blah 0000202 thisisasentenc1thisisasentenc1 999292
I want to remove the duplicate sentence so it returns:
00101 blah 0000202 thisisasentence 99929
00102 blah 0000202 thisisasentenc1 999292
The width isn't fixed or anything like that.
I think this is close but I don't understand regex well and it highlights everything in the file except the last line - correctly finding the duplicate but only once. Removing duplicate strings/words(not lines) using RegEx(notepad++)
Note I can also use the following to identify which parts of each line is duplicated - it highlights the duplicated values (thisisasentencethisisasentence) but I don't know how to split it
(.{5,})\1
Any help would be appreciated, thanks.
EDIT I can reformat to create comma delimited (to some extent): (note with this, there is a chance a comma exists in the duplicated string but don't worry about that)
00101,blah,0000202,thisisasentencethisisasentence,99929
00102,blah,0000202,thisisasentenc1thisisasentenc1,999292
Upvotes: 2
Views: 1152
Reputation: 89557
You can use this pattern in notepad++ with an empty string as replacement:
^(?>\S+[^\S\n]+){3,}?(\S+?)\K\1(?!\S)
pattern details:
^ # anchors for the start of the line (by default in notepad++)
(?> # atomic group: a column and the following space
\S+ # all that is not a white-space character
[^\S\n]+ # white-spaces except newlines
){3,}? # repeat 3 or more times (non-greedy) until you find
(\S+?)\K\1(?!\S) # a column with a duplicate
details of the last subpattern:
(\S+?) # capture one or more non-white characters
# (non-greedy: until \1(?!\S) succeeds)
\K # discard all previous characters from whole match result
\1 # back-reference to the capture group 1
(?!\S) # ensure that the end of the column is reached
Note: using {5,}
instead of +
in \S+?
(so \S{5,}?
) is a good idea, if you are sure that columns contain at least five characters.
Upvotes: 1
Reputation: 44833
You say you are happy with what (.{5,})\1
matches. So, just use $1
as the replacement value. It will automatically replace the repeated part and its copy with a single copy of the text.
Upvotes: 0