WetCheerios
WetCheerios

Reputation: 155

Recursive search in Perl?

I'm incredibly new to Perl, and never have been a phenomenal programmer. I have some successful BVA routines for controlling microprocessor functions, but never anything embedded, or multi-facted. Anyway, my question today is about a boggle I cannot get over when trying to figure out how to remove duplicate lines of text from a text file I created.

The file could have several of the same lines of txt in it, not sequentially placed, which is problematic as I'm practically comparing the file to itself, line by line. So, if the first and third lines are the same, I'll write the first line to a new file, not the third. But when I compare the third line, I'll write it again since the first line is "forgotten" by my current code. I'm sure there's a simple way to do this, but I have issue making things simple in code. Here's the code:

my $searchString = pseudo variable "ideally an iterative search through the source file";
my $file2 = "/tmp/cutdown.txt";
my $file3 = "/tmp/output.txt";
my $count = "0";

open (FILE, $file2) || die "Can't open cutdown.txt \n";
open (FILE2, ">$file3") || die "Can't open output.txt \n";
    while (<FILE>) {
        print "$_";
        print "$searchString\n";
        if (($_ =~ /$searchString/) and ($count == "0")) {
            ++ $count;
            print FILE2 $_;
            } else {
            print "This isn't working\n";
        }
    }
close (FILE);

close (FILE2);

Excuse the way filehandles and scalars do not match. It is a work in progress... :)

Upvotes: 2

Views: 93

Answers (3)

simbabque
simbabque

Reputation: 54381

You need two things to do that:

  • a hash to keep track of all the lines you have seen
  • a loop reading the input file

This is a simple implementation, called with an input filename and an output filename.

use strict;
use warnings;

open my $fh_in, '<', $ARGV[0] or die "Could not open file '$ARGV[0]': $!";
open my $fh_out, '<', $ARGV[1] or die "Could not open file '$ARGV[1]': $!";

my %seen;

while (my $line = <$fh_in>) {

    # check if we have already seen this line
    if (not $seen{$line}) {
        print $fh_out $line;
    }

    # remember this line
    $seen{$line}++;
}

To test it, I've included it with the DATA handle as well.

use strict;
use warnings;

my %seen;

while (my $line = <DATA>) {

    # check if we have already seen this line
    if (not $seen{$line}) {
        print $line;
    }

    # remember this line
    $seen{$line}++;
}

__DATA__
foo
bar
asdf
foo
foo
asdfg
hello world

This will print

foo
bar
asdf
asdfg
hello world

Keep in mind that the memory consumption will grow with the file size. It should be fine as long as the text file is smaller than your RAM. Perl's hash memory consumption grows a faster than linear, but your data structure is very flat.

Upvotes: 1

Dave Cross
Dave Cross

Reputation: 69314

The secret of checking for uniqueness, is to store the lines you have seen in a hash and only print lines that don't exist in the hash.

Updating your code slightly to use more modern practices (three-arg open(), lexical filehandles) we get this:

my $file2 = "/tmp/cutdown.txt";
my $file3 = "/tmp/output.txt";

open my $in_fh,  '<', $file2 or die "Can't open cutdown.txt: $!\n";
open my $out_fh, '>', $file3 or die "Can't open output.txt: $!\n";

my %seen;

while (<$in_fh>) {
  print $out_fh unless $seen{$_}++;
}

But I would write this as a Unix filter. Read from STDIN and write to STDOUT. That way, your program is more flexible. The whole code becomes:

#!/usr/bin/perl

use strict;
use warnings;

my %seen;

while (<>) {
  print unless $seen{$_}++;
}

Assuming this is in a file called my_filter, you would call it as:

$ ./my_filter < /tmp/cutdown.txt > /tmp/output.txt

Update: But this doesn't use your $searchString variable. It's not clear to me what that's for.

Upvotes: 4

Miguel Prz
Miguel Prz

Reputation: 13792

If your file is not very large, you can store each line readed from the input file as a key in a hash variable. And then, print the hash keys (ordered). Something like that:

my %lines = ();
my $order = 1;

open my $fhi, "<", $file2 or die "Cannot open file: $!";
while( my $line = <$fhi> ) {
   $lines {$line} = $order++;
}
close $fhi;

open my $fho, ">", $file3 or die "Cannot open file: $!";

#Sort the keys, only if needed
my @ordered_lines = sort { $lines{$a} <=> $lines{$b} } keys(%lines);
for my $key( @ordered_lines ) {
   print $fho $key;
}

close $fho;

Upvotes: 1

Related Questions