Reputation: 5073
I have created a Perl script for reading a file that contains some numbers, one below the other. I want to eliminate the duplicates & save the new list to a file. Here is my script:
use strict;
my $arg = "<abs path to>\\list.txt";
open (FH, "$arg") or die "\nError trying to open the file $arg : $!";
print "Opened File : $arg\n";
my $line = "";
my @lines = <FH>;
close FH;
my $temp;
my $count = 0;
my $check = 0;
my @list;
my $flag;
for $line (@lines)
{
$count += 1;
$check = $count;
$flag = 1;
for my $next (@lines)
{
$check -= 1;
if($check < 0)
{
if ($line == $next)
{
$flag = 0;
}
}
}
if($flag == 1)
{
push (@list, $line);
}
}
my $newarg = "<abs path to>\\new_list.txt";
open (FWH, ">>$newarg") or die "\nError trying to open the file $newarg for writing : $!";
my $size = @list;
print FWH "\n\n*** Size = $size ***\n\n";
for my $line (@list)
{
print FWH "$line";
}
I am a C++ guy trying to learn Perl. So can you please suggest me some API in Perl which might have reduced the size of script. I want the script to be readable & simple to understand quickly hence the spacing. Thank You.
Upvotes: 0
Views: 109
Reputation: 107030
Whenever you have to keep track of something, think of hash. A hash has several very nice attributes:
Here's a quick rework.
#! /usr/bin/env perl
use strict;
use feature qw(say);
use warnings;
use autodie;
Notice I have use warnings
as well as use strict
. I tell people that use strict
can catch about 90% of their errors. Well, use warnings
can catch another 9.99% of your errors. Warnings are for things like attempting to print out a variable that is undefined, or for poor syntax stuff that will probably get you in trouble.
use feature qw(say);
allows you to use say
instead of print
. With say
, the NL is included, so you don't have to use \n
all the time. It doesn't sound like much, but it's nice. use autodie
will do things like automatically kill your program if you can't open a file. It turns Perl into a somewhat exception based language. This way, if you forget to test for something, your program will let you know.
use constant {
FILE => '/path/to/file',
OUTPUT => '/path/to/output/file',
};
Constants are what you should use when you need something that's constant.
open my $numfile_fh, "<", FILE; #No need for die
open my $output_fh, ">", OUTPUT;
my %number_hash;
while ( my $number = <$numfile_fh> ) {
chomp $number; #Always chomp after you read
if ( not exists $number_hash{$number} ) {
$number_hash{$number} = 1;
say $output_fh "$number";
}
}
close $numfile_fh;
close $output_fh;
I am reading your file one number at a time, but instead of simply writing it to a file, I check my %number_hash
to see if I've already seen that number. If I haven't, I store it in my %number_hash
and print it out. The logic could be written this way:
while ( my $number = <$numfile_fh> ) {
chomp $number; #Always chomp after you read
next if exists $number_hash{$number};
$number_hash{$number} = 1;
say $output_fh "$number";
}
Some people will say this is a better way to write the loop logic. In this style, you're eliminating the exceptions (numbers that are repeats) and then handling the default case (printing the number read in and saving it in the hash).
Note that none of this is actually changing the order of your list. You read in a number, and as long as it's not a duplicate, you print it in the same order as you read it in. If you want to reorder the numbers, so they're sorted, use two loops:
while ( my $number = <$numfile_fh> ) {
chomp $number; #Always chomp after you read
$number_hash{$number} = 1;
}
for my $number ( sort keys %number_hash ) {
say $output_fh "$number";
}
Note that I don't bother to test whether or not the number is in the array. There's no need for that because hashes can only have a single key per value anyway.
Upvotes: 3
Reputation: 5619
So you have a file of numbers and you want to remove duplicates from it while preserving order? This is a one-liner in Perl.
perl -ne 'print unless $seen{$_}++' file > newfile
Or:
# saves original in file.bak
perl -i.bak -ne 'print unless $seen{$_}++' file
If you have lines that contain other than a single number, or if you want to print out some stats, or if you want better argument handling, or if you've noticed that this doesn't de-dupe numbers that have differing whitespace, then go ahead and change this appropriately. For instance:
# whitespace/non-numbers tolerant
perl -i.bak -ne 'if (/^\s*(\d+)\s*$/) { print unless $seen{$1}++ } else { print }'
As a script, the key logic is exactly the same:
#! /usr/bin/env perl
use common::sense;
use autodie;
my $silent;
$silent = shift if (@ARGV > 0 and $ARGV[0] eq '-s');
die "usage: $0 [-s] src dest\n" unless @ARGV == 2;
open my $fi, '<', shift;
open my $fo, '>', shift;
my %seen;
while (<$fi>) {
if (/^\s* (\d+) \s*$/x) {
print {$fo} $_ unless $seen{$1}++;
next;
}
print {$fo} $_;
}
unless ($silent) {
say '-- de-dup stats --';
say '-- $count $number --'
}
for (sort { $a <=> $b } keys %seen) {
say "$seen{$_} $_"
}
EDIT: heh, I didn't even consider the case where the duplicates are all adjacent. Here there's no need for a hash:
perl -ne 'print unless $_ == $last; $last = $_' file > newfile
Upvotes: 4
Reputation: 241748
Not much to add to your coding style, just read the comments:
my $arg = "<abs path to>\\list.txt";
# Use lexical file handles and 3 argument form of open:
open my $FH, '<', $arg or die "\nError trying to open the file $arg : $!";
print "Opened File : $arg\n";
my @lines = <$FH>;
close $FH;
# Define each variable in the tightest scope possible.
my $count = 0;
my @list;
for my $line (@lines)
{
$count += 1;
my $check = $count;
my $flag = 1;
for my $next (@lines)
{
$check -= 1;
if($check < 0)
{
if ($line == $next)
{
$flag = 0;
}
}
}
if ($flag == 1)
{
push @list, $line;
}
}
my $newarg = "<abs path to>\\new_list.txt";
open my $FWH, '>>', $newarg or die "\nError trying to open the file $newarg for writing : $!";
my $size = @list;
print $FWH "\n\n*** Size = $size ***\n\n";
for my $line (@list)
{
# Double quotes not needed if there is nothing to interpolate.
print $FWH $line;
}
# You forgot to close the file. For output files, this is important.
close $FWH or die "\nCannot close $newarg: $!";
Nevertheless, this is how I would implement your algorithm:
#!/usr/bin/perl
use warnings;
use strict;
my $input_file = 'PATH/TO/FILE.TXT';
my $output_file = "$input_file.out";
open my $IN, '<', $input_file or die "Cannot open $input_file: $!\n";
open my $OUT, '>', $output_file or die "Cannot open $output_file: $!\n";
my $previous = 'inf';
while (my $line = <$IN>) {
print $OUT $line if $previous != $line;
$previous = $line;
}
close $OUT;
Upvotes: 4
Reputation: 67211
Why cant you simply use some other tool like awk:
awk '!_[$0]++' your_file
you also have a utility in perl for getting uniq elements in a array:
use List::MoreUtils qw/ uniq /;
my @unique = uniq @lines;
If you donot want to use this above utility, you can any how do:
my %seen;
my @unique = grep { ! $seen{$_}++ } @faculty;
Or you can simply use this function below for getting the uniq elements:
sub uniq {
return keys %{{ map { $_ => 1 } @_ }};
}
call this above as: uniq(@myarray);
Upvotes: 2