user1987607
user1987607

Reputation: 2157

remove unique lines from text file using perl

I'm doing some filtering on a text file with multiple columns in perl

The file has the following format:

C1  C2  C3  C4 
1   ..  ..  ..
2   ..  ..  ..
3   ..  ..  ..
3   ..  ..  ..
3   ..  ..  ..

I want to delete all the lines that have a unique value in column 1. So the output should be this:

C1  C2  C3  C4
3   ..  ..  ..
3   ..  ..  ..
3   ..  ..  ..

I'm doing different filtering steps with this file. This is the script with which I'm working

my $ DATA
my $filename = $ARGV[0];
    unless ($filename) {
        print "Enter filename:\n";
        $filename = <STDIN>;
        chomp $filename;
     }
open($DATA,'<',$filename) or die "Could not open file $filename $!";
open($OUT,'+>',"processed.txt") or die "Can't write new file: $!";

while(<$DATA>){
    next if /^\s*#/; 
    print $OUT $_;
    }

close $OUT;

As you can see, I'm working in a while loop in which I already used the next command to remove the comment lines from the file. Now I want to add to this loop the command to remove all the lines with a unique value in column 1.

Could someone help me with this?

Upvotes: 0

Views: 2359

Answers (4)

Ekkehard.Horner
Ekkehard.Horner

Reputation: 38745

Mostly stolen from ikegami and mattan:

print "header: ", scalar(<>);
print "multis: \n";

my %seen;
while (<>) {
   next if /^\s*#/;
   my ($id) = /^(\S+)/;
   ++$seen{$id}{count};
   if (1 == $seen{$id}{count}) {
      # store first occurrence
      $seen{$id}{line} = $_;
   } elsif (2 == $seen{$id}{count}) {
      # print first & second occurrence
      print $seen{$id}{line};
      print $_;
   } else {
      # print Third ... occurrence
      print $_;
   }
}

but keeps order and uses just one loop.

Later:

After thinking twice about

yes, they [the lines] should stay the same as they are now, which is in numerical order [of ids]

I can give back the solen goods:

print "header: ", scalar(<>);
print "multis: \n";

my $ol = scalar(<>);                      # first/old line
my $oi = 0 + (split(" ", $ol, 2))[0];     # first/old id
my $bf = -1;                              # assume old line must be printed
do {
   my $cl = scalar(<>);                   # current line
   my $ci = 0 + (split(" ", $cl, 2))[0];  # current id
   if ($oi != $ci) {                      # old and current id differ
      $oi = $ci;                          #   remember current/first line of current id
      $ol = $cl;                          #   current id becomes old
      $bf = -1;                           #   assume first/old line must be printed
   } else {                               # old and current id are equal
      if ($bf) {                          #    first/old line of current id must be printed
        print $ol;                        #      do it
        $bf = 0;                          #      but not again
      }
      print $cl;                          #    print current line for same id
   }
} while (! eof());

Upvotes: 2

Mattan
Mattan

Reputation: 753

my %id_count;
while(my $line = <$DATA>){
    next if $line =~ /^\s*#/; 
    my ($id) = split(/\s+/,$line,1);
    $id_count{$id}{lines} .= $line;
    $id_count{$id}{counter}++;
}

print $OUT join("",map { $id_count{$_}{lines} } grep { $id_count{$_}{counter} ne "1" } keys %id_count);

Edit: If you want to keep the lines sorted, just add a sort before the grep in the last line.

Upvotes: 1

Borodin
Borodin

Reputation: 126722

This is done neatly with Tie::File, which allows you to map an array to a text file so that removing elements from the array also removes lines from the file.

This program takes two passes through the file: the first one to count the number of occcurrences of each value of the first field, and the second to delete the lines where that field is unique in the file.

use strict;
use warnings;

use Tie::File;

tie my @file, 'Tie::File', 'textfile.txt' or die $!;

my %index;

for (@file) {
  $index{$1}++ if /^(\d+)/;
}

for (my $i = 1; $i < @file; ++$i) {
  if ( $file[$i] =~ /^(\d+)/ and $index{$1} == 1 ) {
    splice @file, $i, 1;
    --$i;
  }
}

Upvotes: 2

ikegami
ikegami

Reputation: 385857

First, let's get rid of the extraneous stuff from your program.

while (<>) {
   next if /^\s*#/; 
   print;
}

Ok, it looks like you didn't even extra the value of the first column.

my ($id) = /^(\S+)/;

We don't know if there's going to be duplicates before reading on, so we need to store lines for later use.

push @{ $by_id{$id} }, $_;

Once we've read through the file, we print out the lines for ids with more than one line.

for my $id (keys(%by_id)) {
    print @{ $by_id{$id} } if @{ $by_id{$id} } > 1;
}

Finally, you failed to handle the header, which can be done using

print scalar(<>);

Altogether, we get

print scalar(<>);

my %by_id;
while (<>) {
   next if /^\s*#/; 
   my ($id) = /^(\S+)/;
   push @{ $by_id{$id} }, $_;
}

for my $id (sort { $a <=> $b } keys(%by_id)) {
    print @{ $by_id{$id} } if @{ $by_id{$id} } > 1;
}

Usage:

script.pl file.in >processed.txt

Upvotes: 0

Related Questions