Thomas Anowez
Thomas Anowez

Reputation: 91

Remove multiple duplicate lines from a file

I have a Perl script run in crontab that generates a file rich with duplicate entries, because on each run it rewrites information previously written.

I would use a sort -u of file, but, I would do it at the end of the Perl script file.

My list

10/10/2017 00:01:39:000;Sagitter
10/11/2017 00:00:01:002;Lupus
10/12/2017 00:03:14:109;Leon
10/12/2017 00:09:00:459;Sagitter
10/13/2017 01:11:03:009;Lupus
12/13/2017 04:29:00:609;Ariet
10/11/2017 00:00:01:002;Lupus
10/12/2017 00:03:14:109;Leon
...

My code

#!/usr/bin/perl

# Libraries
use strict;
use warnings 'all';

%lines = ();

# Remove duplicate

open( TMP_GL_OUTPUT, '>', $OUTPUT_FILE ) or die $!;

while ( <TMP_GL_OUTPUT> ) {
    $lines{$_}++;
}

open( OUTFILE, '>', $TMPOUTPUT_FILE ) or die $!;
print OUTFILE keys %lines;
close( OUTFILE );

close( TMP_GL_OUTPUT );

Where am I going wrong? In shell it feels shorter than in Perl.

sort -u $TMPOUTPUT_FILE > $OUTPUT_FILE 

As Suggested by ikegamy user, I've do as following:

move $OUTPUT_FILE, $TMPOUTPUT_FILE; # Copy file
run [ 'sort', '-u', '--', $TMPOUTPUT_FILE ], '>', $OUTPUT_FILE; # Remove duplicate
unlink $TMPOUTPUT_FILE;

Upvotes: 2

Views: 4446

Answers (3)

Valdi_Bo
Valdi_Bo

Reputation: 30971

Your code looks almost OK.

My proposition is only to chomp each line, before you save an element in the hash.

The reason is that e.g. the last line, not terminated with a \n may look just the same as one of previous lines, but without chomp the previous line would have contained the terminating \n, whereas the last - not.

The resut is that both these lines will be different keys in the hash.

Compare my example program (working, presented below) with yours, there are no other significant differences, apart from reading from __DATA__ and writing to the console.

In my program, for demonstration purposes, I put 2 variants of printout, one with key values (repetition counts) and another, printing just keys. In your program leave only the second printout.

use strict; use warnings; use feature qw(say);

my %lines;
while(<DATA>) {
    chomp;
    $lines{$_}++;
}
while(my($key, $val) = each %lines) {
    printf "%-32s / %d\n", $key, $val;
}
say '========';
foreach my $key (keys %lines) {
    say $key;
}
__DATA__
10/10/2017 00:01:39:000;Sagitter
10/11/2017 00:00:01:002;Lupus
10/12/2017 00:03:14:109;Leon
10/12/2017 00:09:00:459;Sagitter
10/13/2017 01:11:03:009;Lupus
12/13/2017 04:29:00:609;Ariet
10/11/2017 00:00:01:002;Lupus
10/12/2017 00:03:14:109;Leon

Edit

Your code assigns no names to $OUTPUT_FILE and $TMPOUTPUT_FILE, you even didn't declare these variables, but I assume, that in your actual code you did it.

Another detail is that %lines should be preceded with my, otherwise, as you put use strict; the compiler prints an error.

Edit 2

There is a quicker and shorter solution than yours.

Instead of writing lines to a hash and printing them as late as in the second step, you can do it in a single loop:

  • Read the line.
  • Check whether the hash already contains a key equal to the line just read.
  • If not, then:
    • write the line to the hash, to block the printout, if just the same line occured again,
    • print the line.

You can even write this program as a Perl one-liner:

perl -lne"print if !$lines{$_}++" input.txt

If you run the above command from the Windows cmd, it will print the output to the console. If you use Linux, instead of double quotes, you can use apostrophes.

You may of course redirect the output to any file, adding > output.txt to the above command.

The code is executed for each input line, chomped due to -l option.

If any other details concerning Perl one-liners are not known to you, search the web.

Upvotes: 0

elcaro
elcaro

Reputation: 2297

List::Util is a core module.

use List::Util 'uniq';

print for uniq <>

Upvotes: 1

ikegami
ikegami

Reputation: 385655

I think you are asking why your Perl program is longer than your shell script.

First of all, your shell script does something completely different than your Perl program.

  • Your shell script executes a program, and stores its out in a file.
  • Your Perl program reads a file, manipulates the data it read, and stores the output in a file.

The Perl equivalent to

sort -u -- "$TMPOUTPUT_FILE" > "$OUTPUT_FILE"

is

use IPC::Run qw( run );

run [ 'sort', '-u', '--', $TMPOUTPUT_FILE ], '>', $OUTPUT_FILE;

(There are differences in error handling between these two.)

They're not that different in length.

This brings up the second difference. The shell specializes in executing programs, but Perl is a general purpose language. It would be surprising if it wasn't longer in Perl!

(Now try comparing the size of your Perl program to the source of sort...)

Upvotes: 6

Related Questions