J.Carter
J.Carter

Reputation: 331

Delete repeated value containing lines after keeping the first line

I have a tab separated large file like this:

input.txt

a   b   c
s   t   e
a   b   c
f   q   y
r   e   x

to delete the repeated lines (rows) in this file, i use:

my %seen;
my @lines;

while (<>) {
    my @cols = split /\s+/;
    unless ($seen{$cols[0]}++) {
        push @lines, $_;
    }
}

print @lines;

the output here is:

a   b   c
s   t   e
f   q   y
r   e   x

Now if I want to delete those lines too that contain repeted values (means: that value once appear anywhere in upper rows/columns, here "e") and keep only the uppermost value containing line, please suggest what will be the most preffered approach keeping in mind that my input file is very large with many columns and rows.

model output that I want for the above input.txt would be:

a   b   c
s   t   e
f   q   y

Thank you

Upvotes: 1

Views: 81

Answers (2)

Borodin
Borodin

Reputation: 126722

As I wrote in my comments, split /\s+/ is very rarely correct

And the solution you have mishandles lines with duplicate fields

It's also more efficient to replace grep with any from the core List::Util module

I suggest that you store the fields of each line in a hash %cols, like this

use strict;
use warnings 'all';

use List::Util 'any';

my ( @lines, %seen );

while ( <DATA> ) {

    my %cols = map { $_ => 1 } split;

    push @lines, $_ unless any { $seen{$_}++ } keys %cols;
}

print for @lines;

__DATA__
a   b   c
p   p   p
p   q   r
s   t   e
a   b   c
f   q   y
r   e   x

output

a   b   c
p   p   p
s   t   e

Even this may not be what you want, as the line f q y is omitted because q has already been "seen" in the omitted line p q r. You will have to clarify the required behaviour in this situation

Upvotes: 2

PerlDuck
PerlDuck

Reputation: 5730

You also need to iterate over the @cols and examine every item instead of just the first one, $cols[0]. You need something like

unless ($seen{$cols[0]}++ || $seen{$cols[1]}++ || $seen{$cols[2]}++ ...) {
    push @lines, $_;
}

Of course that would be bad style and impossible if you don't know the number of columns in advance.

I would do it with grep:

my %seen;
my @lines;

while (<DATA>) {
    my @cols = split /\s+/;
    unless ( grep { $seen{$_}++ } @cols ) {
        push @lines, $_;
    }
}

print @lines;


__DATA__
a   b   c
s   t   e
a   b   c
f   q   y
r   e   x

Output:

a   b   c
s   t   e
f   q   y

grep processes the code between the curlies { $seen{$_}++ } for each element in the list @cols and returns (in scalar context) the number of items that evaluated to true.

It's not the fastest approach because it always iterates over the whole array (even if the first evaluation would be sufficient for your particular test). But give it a try; perhaps it's fast enough for you.

Upvotes: 4

Related Questions