Reputation: 41
^(.*)(\r?\n\1)+$
replace with \1
The above is a great way to remove duplicate lines using REGEX but it requires the entire line to be a duplicate
However – what would I use if I want to detect and remove dups – when the entire line s a whole is not a dup – but just the first X characters
Example: Original File
12345 Dennis Yancey University of Miami
12345 Dennis Yancey University of Milan
12345 Dennis Yancey University of Rome
12344 Ryan Gardner University of Spain
12347 Smith John University of Canada
Dups Removed
12345 Dennis Yancey University of Miami
12344 Ryan Gardner University of Spain
12347 Smith John University of Canada
Upvotes: 4
Views: 356
Reputation: 18490
How about using a second group for checking eg the first 10 characters:
^((.{10}).*)(?:\r?\n\2.*)+
Where {n}
specifies the amount of the characters from linestart that should be dupe checked.
$1
which is also used as replacementAnother idea would be the use of a lookahead and replace with empty string:
^(.{10}).*\r?\n(?=\1)
This one will just drop the current line, if captured $1
is ahead in the next line.
For also removing duplicate lines, that contain up to 10 characters, a PCRE idea using conditionals: ^(?:(.{10})|(.{0,9}$)).*+\r?\n(?(1)(?=\1)|(?=\2$))
and replace with empty string.
If your regex flavor supports possessive quantifiers, use of .*+
will improve performance.
Be aware, that all these patterns (and your current regex) just target consecutive duplicate lines.
Upvotes: 3