Reputation: 2157
I'm doing some filtering on a text file with multiple columns in perl
The file has the following format:
C1 C2 C3 C4
1 .. .. ..
2 .. .. ..
3 .. .. ..
3 .. .. ..
3 .. .. ..
I want to delete all the lines that have a unique value in column 1. So the output should be this:
C1 C2 C3 C4
3 .. .. ..
3 .. .. ..
3 .. .. ..
I'm doing different filtering steps with this file. This is the script with which I'm working
my $ DATA
my $filename = $ARGV[0];
unless ($filename) {
print "Enter filename:\n";
$filename = <STDIN>;
chomp $filename;
}
open($DATA,'<',$filename) or die "Could not open file $filename $!";
open($OUT,'+>',"processed.txt") or die "Can't write new file: $!";
while(<$DATA>){
next if /^\s*#/;
print $OUT $_;
}
close $OUT;
As you can see, I'm working in a while loop in which I already used the next command to remove the comment lines from the file. Now I want to add to this loop the command to remove all the lines with a unique value in column 1.
Could someone help me with this?
Upvotes: 0
Views: 2359
Reputation: 38745
Mostly stolen from ikegami and mattan:
print "header: ", scalar(<>);
print "multis: \n";
my %seen;
while (<>) {
next if /^\s*#/;
my ($id) = /^(\S+)/;
++$seen{$id}{count};
if (1 == $seen{$id}{count}) {
# store first occurrence
$seen{$id}{line} = $_;
} elsif (2 == $seen{$id}{count}) {
# print first & second occurrence
print $seen{$id}{line};
print $_;
} else {
# print Third ... occurrence
print $_;
}
}
but keeps order and uses just one loop.
Later:
After thinking twice about
yes, they [the lines] should stay the same as they are now, which is in numerical order [of ids]
I can give back the solen goods:
print "header: ", scalar(<>);
print "multis: \n";
my $ol = scalar(<>); # first/old line
my $oi = 0 + (split(" ", $ol, 2))[0]; # first/old id
my $bf = -1; # assume old line must be printed
do {
my $cl = scalar(<>); # current line
my $ci = 0 + (split(" ", $cl, 2))[0]; # current id
if ($oi != $ci) { # old and current id differ
$oi = $ci; # remember current/first line of current id
$ol = $cl; # current id becomes old
$bf = -1; # assume first/old line must be printed
} else { # old and current id are equal
if ($bf) { # first/old line of current id must be printed
print $ol; # do it
$bf = 0; # but not again
}
print $cl; # print current line for same id
}
} while (! eof());
Upvotes: 2
Reputation: 753
my %id_count;
while(my $line = <$DATA>){
next if $line =~ /^\s*#/;
my ($id) = split(/\s+/,$line,1);
$id_count{$id}{lines} .= $line;
$id_count{$id}{counter}++;
}
print $OUT join("",map { $id_count{$_}{lines} } grep { $id_count{$_}{counter} ne "1" } keys %id_count);
Edit:
If you want to keep the lines sorted, just add a sort
before the grep
in the last line.
Upvotes: 1
Reputation: 126722
This is done neatly with Tie::File
, which allows you to map an array to a text file so that removing elements from the array also removes lines from the file.
This program takes two passes through the file: the first one to count the number of occcurrences of each value of the first field, and the second to delete the lines where that field is unique in the file.
use strict;
use warnings;
use Tie::File;
tie my @file, 'Tie::File', 'textfile.txt' or die $!;
my %index;
for (@file) {
$index{$1}++ if /^(\d+)/;
}
for (my $i = 1; $i < @file; ++$i) {
if ( $file[$i] =~ /^(\d+)/ and $index{$1} == 1 ) {
splice @file, $i, 1;
--$i;
}
}
Upvotes: 2
Reputation: 385857
First, let's get rid of the extraneous stuff from your program.
while (<>) {
next if /^\s*#/;
print;
}
Ok, it looks like you didn't even extra the value of the first column.
my ($id) = /^(\S+)/;
We don't know if there's going to be duplicates before reading on, so we need to store lines for later use.
push @{ $by_id{$id} }, $_;
Once we've read through the file, we print out the lines for ids with more than one line.
for my $id (keys(%by_id)) {
print @{ $by_id{$id} } if @{ $by_id{$id} } > 1;
}
Finally, you failed to handle the header, which can be done using
print scalar(<>);
Altogether, we get
print scalar(<>);
my %by_id;
while (<>) {
next if /^\s*#/;
my ($id) = /^(\S+)/;
push @{ $by_id{$id} }, $_;
}
for my $id (sort { $a <=> $b } keys(%by_id)) {
print @{ $by_id{$id} } if @{ $by_id{$id} } > 1;
}
Usage:
script.pl file.in >processed.txt
Upvotes: 0