Reputation: 331
I have a tab separated large file like this:
input.txt
a b c
s t e
a b c
f q y
r e x
to delete the repeated lines (rows) in this file, i use:
my %seen;
my @lines;
while (<>) {
my @cols = split /\s+/;
unless ($seen{$cols[0]}++) {
push @lines, $_;
}
}
print @lines;
the output here is:
a b c
s t e
f q y
r e x
Now if I want to delete those lines too that contain repeted values (means: that value once appear anywhere in upper rows/columns, here "e") and keep only the uppermost value containing line, please suggest what will be the most preffered approach keeping in mind that my input file is very large with many columns and rows.
model output that I want for the above input.txt would be:
a b c
s t e
f q y
Thank you
Upvotes: 1
Views: 81
Reputation: 126722
As I wrote in my comments, split /\s+/
is very rarely correct
And the solution you have mishandles lines with duplicate fields
It's also more efficient to replace grep
with any
from the core List::Util
module
I suggest that you store the fields of each line in a hash %cols
, like this
use strict;
use warnings 'all';
use List::Util 'any';
my ( @lines, %seen );
while ( <DATA> ) {
my %cols = map { $_ => 1 } split;
push @lines, $_ unless any { $seen{$_}++ } keys %cols;
}
print for @lines;
__DATA__
a b c
p p p
p q r
s t e
a b c
f q y
r e x
a b c
p p p
s t e
Even this may not be what you want, as the line f q y
is omitted because q
has already been "seen" in the omitted line p q r
. You will have to clarify the required behaviour in this situation
Upvotes: 2
Reputation: 5730
You also need to iterate over the @cols
and examine every item instead of just the first one, $cols[0]
.
You need something like
unless ($seen{$cols[0]}++ || $seen{$cols[1]}++ || $seen{$cols[2]}++ ...) {
push @lines, $_;
}
Of course that would be bad style and impossible if you don't know the number of columns in advance.
I would do it with grep
:
my %seen;
my @lines;
while (<DATA>) {
my @cols = split /\s+/;
unless ( grep { $seen{$_}++ } @cols ) {
push @lines, $_;
}
}
print @lines;
__DATA__
a b c
s t e
a b c
f q y
r e x
Output:
a b c
s t e
f q y
grep
processes the code between the curlies { $seen{$_}++ }
for each element in the list @cols
and returns (in scalar context) the number of items that evaluated to true.
It's not the fastest approach because it always iterates over the whole array (even if the first evaluation would be sufficient for your particular test). But give it a try; perhaps it's fast enough for you.
Upvotes: 4