John Kens
John Kens

Reputation: 1679

Remove duplicate lines containing same starting text

So I have a massive list of numbers where all lines contain the same format.

#976B4B|B|0|0
#970000|B|0|1
#974B00|B|0|2
#979700|B|0|3
#4B9700|B|0|4
#009700|B|0|5
#00974B|B|0|6
#009797|B|0|7
#004B97|B|0|8
#000097|B|0|9
#4B0097|B|0|10
#970097|B|0|11
#97004B|B|0|12
#970000|B|0|13
#974B00|B|0|14
#979700|B|0|15
#4B9700|B|0|16
#009700|B|0|17
#00974B|B|0|18
#009797|B|0|19
#004B97|B|0|20
#000097|B|0|21
#4B0097|B|0|22
#970097|B|0|23
#97004B|B|0|24
#2C2C2C|B|0|25
#979797|B|0|26
#676767|B|0|27
#97694A|B|0|28
#020202|B|0|29
#6894B4|B|0|30
#976B4B|B|0|31
#808080|B|1|0
#800000|B|1|1
#803F00|B|1|2
#808000|B|1|3

What I am trying to do is remove all duplicate lines that contain the same hex codes, regardless of the text after it.

Example, in the first line #976B4B|B|0|0 the hex #976B4B shows up in line 32 as #976B4B|B|0|31. I want all lines EXCEPT the first occurrence to be removed.

I have been attempting to use regex to solve this, and found ^(.*)(\r?\n\1)+$ $1 can remove duplicate lines but obviously not what I need. Looking for some guidance and maybe a possibility to learn from this.

Upvotes: 1

Views: 626

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627468

You can use the following regex replacement, make sure you click Replace All as many times as necessary, until no match is found:

Find What: ^((#[[:xdigit:]]+)\|.*(?:\R.+)*?)\R\2\|.*
Replace With: $1

See the regex demo and the demo screenshot:

enter image description here

Details:

  • ^ - start of a line
  • ((#[[:xdigit:]]+)\|.*(?:\R.+)*?) - Group 1 ($1, it will be kept):
    • (#[[:xdigit:]]+) - Group 2: # and one or more hex chars
    • \| - a | char
    • .* - the rest of the line
    • (?:\R.+)*? - any zero or more non-empty lines (if they can be empty, replace .+ with .*)
  • \R\2\|.* - a line break, Group 2 value, | and the rest of the line.

Upvotes: 2

Related Questions