Cool_Coder
Cool_Coder

Reputation: 5073

Suggestions for improvement in Perl script?

I have created a Perl script for reading a file that contains some numbers, one below the other. I want to eliminate the duplicates & save the new list to a file. Here is my script:

use strict;

my $arg = "<abs path to>\\list.txt";
open (FH, "$arg") or die "\nError trying to open the file $arg : $!";
print "Opened File : $arg\n";
my $line = "";
my @lines = <FH>;
close FH;
my $temp;
my $count = 0;
my $check = 0;
my @list;
my $flag;

for $line (@lines)
{
    $count += 1;
    $check = $count;
    $flag = 1;
    for my $next (@lines)
    {
        $check -= 1;
        if($check < 0)
        {
            if ($line == $next)
            {
                $flag = 0;
            }
        }
    }

    if($flag == 1)
    {
        push (@list, $line);
    }
}

my $newarg = "<abs path to>\\new_list.txt";
open (FWH, ">>$newarg") or die "\nError trying to open the file $newarg for writing : $!";
my $size = @list;
print FWH "\n\n*** Size = $size ***\n\n";
for my $line (@list)
{
    print FWH "$line";
}

I am a C++ guy trying to learn Perl. So can you please suggest me some API in Perl which might have reduced the size of script. I want the script to be readable & simple to understand quickly hence the spacing. Thank You.

Upvotes: 0

Views: 109

Answers (4)

David W.
David W.

Reputation: 107030

Whenever you have to keep track of something, think of hash. A hash has several very nice attributes:

  • Only one of that key can exist: Imagine if you stored all of your numbers in a hash keyed by that number. The list of keys contains all of your numbers, and there are no duplicates.
  • Fast Key lookup: Imagine you store your numbers in a hash, again keyed by the number. Did you see that number before? See if that key exists. Fast, and simple.

Here's a quick rework.

#! /usr/bin/env perl
use strict;
use feature qw(say);
use warnings;
use autodie;

Notice I have use warnings as well as use strict. I tell people that use strict can catch about 90% of their errors. Well, use warnings can catch another 9.99% of your errors. Warnings are for things like attempting to print out a variable that is undefined, or for poor syntax stuff that will probably get you in trouble.

use feature qw(say); allows you to use say instead of print. With say, the NL is included, so you don't have to use \n all the time. It doesn't sound like much, but it's nice. use autodie will do things like automatically kill your program if you can't open a file. It turns Perl into a somewhat exception based language. This way, if you forget to test for something, your program will let you know.

use constant {
    FILE         => '/path/to/file',
    OUTPUT       => '/path/to/output/file',
};

Constants are what you should use when you need something that's constant.

open my $numfile_fh, "<", FILE;  #No need for die
open my $output_fh, ">", OUTPUT;
my %number_hash;
while ( my $number = <$numfile_fh> ) {
    chomp $number;   #Always chomp after you read
    if ( not exists $number_hash{$number} ) {
        $number_hash{$number} = 1;
        say $output_fh "$number";
    }
}
close $numfile_fh;
close $output_fh;

I am reading your file one number at a time, but instead of simply writing it to a file, I check my %number_hash to see if I've already seen that number. If I haven't, I store it in my %number_hash and print it out. The logic could be written this way:

while ( my $number = <$numfile_fh> ) {
    chomp $number;   #Always chomp after you read
    next if exists $number_hash{$number};

    $number_hash{$number} = 1;
    say $output_fh "$number";
}

Some people will say this is a better way to write the loop logic. In this style, you're eliminating the exceptions (numbers that are repeats) and then handling the default case (printing the number read in and saving it in the hash).

Note that none of this is actually changing the order of your list. You read in a number, and as long as it's not a duplicate, you print it in the same order as you read it in. If you want to reorder the numbers, so they're sorted, use two loops:

while ( my $number = <$numfile_fh> ) {
    chomp $number;   #Always chomp after you read
     $number_hash{$number} = 1;
}

for my $number ( sort keys %number_hash ) {
    say $output_fh "$number";
}

Note that I don't bother to test whether or not the number is in the array. There's no need for that because hashes can only have a single key per value anyway.

Upvotes: 3

Julian Fondren
Julian Fondren

Reputation: 5619

So you have a file of numbers and you want to remove duplicates from it while preserving order? This is a one-liner in Perl.

perl -ne 'print unless $seen{$_}++' file > newfile

Or:

# saves original in file.bak
perl -i.bak -ne 'print unless $seen{$_}++' file

If you have lines that contain other than a single number, or if you want to print out some stats, or if you want better argument handling, or if you've noticed that this doesn't de-dupe numbers that have differing whitespace, then go ahead and change this appropriately. For instance:

# whitespace/non-numbers tolerant
perl -i.bak -ne 'if (/^\s*(\d+)\s*$/) { print unless $seen{$1}++ } else { print }'

As a script, the key logic is exactly the same:

#! /usr/bin/env perl
use common::sense;
use autodie;

my $silent;
$silent = shift if (@ARGV > 0 and $ARGV[0] eq '-s');
die "usage: $0 [-s] src dest\n" unless @ARGV == 2;

open my $fi, '<', shift;
open my $fo, '>', shift;

my %seen;
while (<$fi>) {
  if (/^\s* (\d+) \s*$/x) {
    print {$fo} $_ unless $seen{$1}++;
    next;
  }
  print {$fo} $_;
}

unless ($silent) {
  say '-- de-dup stats --';
  say '-- $count $number --'
}
for (sort { $a <=> $b } keys %seen) {
  say "$seen{$_} $_"
}

EDIT: heh, I didn't even consider the case where the duplicates are all adjacent. Here there's no need for a hash:

perl -ne 'print unless $_ == $last; $last = $_' file > newfile

Upvotes: 4

choroba
choroba

Reputation: 241748

Not much to add to your coding style, just read the comments:

my $arg = "<abs path to>\\list.txt";

# Use lexical file handles and 3 argument form of open:
open my $FH, '<', $arg or die "\nError trying to open the file $arg : $!";
print "Opened File : $arg\n";

my @lines = <$FH>;
close $FH;

# Define each variable in the tightest scope possible.
my $count = 0;
my @list;

for my $line (@lines)
{
    $count += 1;
    my $check = $count;
    my $flag = 1;
    for my $next (@lines)
    {
        $check -= 1;
        if($check < 0)
        {
            if ($line == $next)
            {
                $flag = 0;
            }
        }
    }

    if ($flag == 1)
    {
        push @list, $line;
    }
}

my $newarg = "<abs path to>\\new_list.txt";
open my $FWH, '>>', $newarg or die "\nError trying to open the file $newarg for writing : $!";
my $size = @list;
print $FWH "\n\n*** Size = $size ***\n\n";
for my $line (@list)
{
    # Double quotes not needed if there is nothing to interpolate.
    print $FWH $line;
}
# You forgot to close the file. For output files, this is important.
close $FWH or die "\nCannot close $newarg: $!";

Nevertheless, this is how I would implement your algorithm:

#!/usr/bin/perl
use warnings;
use strict;

my $input_file  = 'PATH/TO/FILE.TXT';
my $output_file = "$input_file.out";

open my $IN,  '<', $input_file  or die "Cannot open $input_file: $!\n";
open my $OUT, '>', $output_file or die "Cannot open $output_file: $!\n";

my $previous = 'inf';
while (my $line = <$IN>) {
    print $OUT $line if $previous != $line;
    $previous = $line;
}

close $OUT;

Upvotes: 4

Vijay
Vijay

Reputation: 67211

Why cant you simply use some other tool like awk:

awk '!_[$0]++' your_file

you also have a utility in perl for getting uniq elements in a array:

use List::MoreUtils qw/ uniq /;
my @unique = uniq @lines;

If you donot want to use this above utility, you can any how do:

my %seen;
my @unique = grep { ! $seen{$_}++ } @faculty;

Or you can simply use this function below for getting the uniq elements:

sub uniq {
    return keys %{{ map { $_ => 1 } @_ }};
}

call this above as: uniq(@myarray);

Upvotes: 2

Related Questions