Delete repeated value containing lines after keeping the first line

Question

I have a tab separated large file like this:

input.txt

a   b   c
s   t   e
a   b   c
f   q   y
r   e   x

to delete the repeated lines (rows) in this file, i use:

my %seen;
my @lines;

while (<>) {
    my @cols = split /\s+/;
    unless ($seen{$cols[0]}++) {
        push @lines, $_;
    }
}

print @lines;

the output here is:

a   b   c
s   t   e
f   q   y
r   e   x

Now if I want to delete those lines too that contain repeted values (means: that value once appear anywhere in upper rows/columns, here "e") and keep only the uppermost value containing line, please suggest what will be the most preffered approach keeping in mind that my input file is very large with many columns and rows.

model output that I want for the above input.txt would be:

a   b   c
s   t   e
f   q   y

Thank you

Borodin · Accepted Answer

As I wrote in my comments, split /\s+/ is very rarely correct

And the solution you have mishandles lines with duplicate fields

It's also more efficient to replace grep with any from the core List::Util module

I suggest that you store the fields of each line in a hash %cols, like this

use strict;
use warnings 'all';

use List::Util 'any';

my ( @lines, %seen );

while (  ) {

    my %cols = map { $_ => 1 } split;

    push @lines, $_ unless any { $seen{$_}++ } keys %cols;
}

print for @lines;

__DATA__
a   b   c
p   p   p
p   q   r
s   t   e
a   b   c
f   q   y
r   e   x

output

a   b   c
p   p   p
s   t   e

Even this may not be what you want, as the line f q y is omitted because q has already been "seen" in the omitted line p q r. You will have to clarify the required behaviour in this situation

Delete repeated value containing lines after keeping the first line

Answers (2)

output

Related Questions