Noor
Noor

Reputation: 19

how to remove duplicate lines using perl script

How to remove duplicate lines?

My current code:

use strict;
use warnings;
my $input = input.txt;
my $output = output.txt;
my %seen;

open("OP",">$output") or die;
open("IP","<$input") or die;

while(my $string = <IP>) {
    my @arr1 = join("",$string);
    my @arr2 = grep { !$seen{$_}++ } @arr1;
    print "@arr2\n";
    print OP "@arr2\n";
}

close("IP");
close("OP");

Input:

india
australia
america
singapore
india
america

Expected output :

india
australia
america
singapore

Upvotes: 0

Views: 652

Answers (4)

Timur Shtatland
Timur Shtatland

Reputation: 12347

Use this Perl one-liner to delete all duplicates, whether adjacent or not:

perl -ne 'print unless $seen{$_}++;' input.txt > output.txt

To delete only adjacent duplicates (as in UNIX uniq command):

perl -ne 'print unless $_ eq $prev; $prev = $_; ' input.txt > output.txt

The Perl one-liners use these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.

When the line is seen for the first time, $seen{$_} is evaluated first, and is false, so the line is printed. Then, $seen{$_} is incremented by one, which makes it true every time the line is seen again (thus the same line is not printed any more).

The first one-liner avoids reading the entire file into memory all at once, which could be important for inputs with lots of long duplicated lines. Only the first occurrence of every line is stored in memory, together with its number of occurrences.

SEE ALSO:

Upvotes: 4

Dave Cross
Dave Cross

Reputation: 69244

You are making this all far too complicated. The main section of your code can be simplified to:

while (<IP>) {
  print unless $seen{$_}++;
}

Or even:

print grep { ! $seen{$_}++ } <IP>;

Upvotes: 1

vkk05
vkk05

Reputation: 3222

Removed unwanted line of codes from script.

Here is the updated script:

use strict; use warnings;
use Data::Dumper;

my %seen;

my @lines = <DATA>;
chomp @lines;

my @contries = grep { !$seen{$_}++ } @lines;
print Dumper(\@contries);

__DATA__
india
australia
america
singapore
india
america

Result:

$VAR1 = [
          'india',
          'australia',
          'america',
          'singapore'
        ];

Upvotes: 2

Polar Bear
Polar Bear

Reputation: 6798

Please investigate the following code snippet, you was very close to utilize %seen hash.

use strict;
use warnings;
use feature 'say';

my %seen;
my @uniq;

while( <DATA> ) {
    chomp;
    push @uniq, $_ unless $seen{$_};
    $seen{$_} = 1;
}

say for @uniq;

__DATA__
india
australia
america
singapore
india
america

Output

india
australia
america
singapore

Upvotes: 2

Related Questions