karate_kid
karate_kid

Reputation: 155

To push unique elements read from file using regex into array-Perl

Here is my file:

  heaven
  heavenly
  heavenns
  abc
  heavenns
  heavennly

According to my code, only heavenns and heavennly should be pushed into @myarr, and they should be in array only one time. How to do that?

my $regx = "heavenn\+";
my $tmp=$regx;

$tmp=~ s/[\\]//g;

$regx=$tmp;
print("\nNow regex:", $regx);

my $file  = "myfilename.txt";

my @myarr;
open my $fh, "<", $file;  
while ( my $line = <$fh> ) {
 if ($line =~ /$regx/){
    print $line;
push (@myarr,$line);
}
}

print ("\nMylist:", @myarr); #printing 2 times heavenns and heavennly

Upvotes: 1

Views: 1497

Answers (3)

Jonathan Leffler
Jonathan Leffler

Reputation: 755074

This is Perl, so There's More Than One Way To Do It (TMTOWTDI). Here's one of them:

#!/usr/bin/env perl
use strict;
use warnings;

my $regex = "heavenn+";
my $rx = qr/$regex/;
print "Regex: $regex\n";

my $file  = "myfilename.txt";
my %list;
my @myarr;
open my $fh, "<", $file or die "Failed to open $file: $?";

while ( my $line = <$fh> )
{
    if ($line =~ $rx)
    {
        print $line;
        $list{$line}++;
    }
}

push @myarr, sort keys %list;

print "Mylist: @myarr\n";

Sample output:

Regex: heavenn+
heavenns
heavenns
heavennly
Mylist: heavennly
 heavenns

The sort isn't necessary (but it presents the data in a sane order). You could add items to the array when the count in $list{$line} is 0. You could chomp the input lines to remove the newline. Etc.


What if I want to push only particular words. For example, if my file is, 1. "heavenns hello" 2. "heavenns hi", "3.heavennly good". What to do to print only 'heavenns' and 'heavennly'?

Then you have to arrange to capture the word only. That means refining the regex. Assuming you want heavenn at the start of the word and don't mind what alphabetic characters come after that, then:

#!/usr/bin/env perl
use strict;
use warnings;

my $regex = '\b(heavenn[A-Za-z]*)\b';  # Single quotes necessary!
my $rx = qr/$regex/;
print "Regex: $regex\n";

my $file  = "myfilename.txt";
my %list;
my @myarr;
open my $fh, "<", $file or die "Failed to open $file: $?";

while ( my $line = <$fh> )
{
    if ($line =~ $rx)
    {
        print $line;
        $list{$1}++;
    }
}

push @myarr, sort keys %list;

print "Mylist: @myarr\n";

Data file:

1. "heavenns hello"
2. "heavenns hi",
"3.heavennly good". What to d
heaven
heavenly
heavenns
abc
heavenns
heavennly

Output:

Regex: \b(heavenn[A-Za-z]*)\b
1. "heavenns hello"
2. "heavenns hi",
"3.heavennly good". What to d
heavenns
heavenns
heavennly
Mylist: heavennly heavenns

Note that the names in the list no longer include newlines.


After a chat

This version takes a regex from the command line. The script invocation is:

perl script.pl -p 'regex' [file ...]

It will read from standard input if no file is specified on the command line (better than having a fixed input file name — by a large margin). It looks for multiple occurrences of the specified regex on each line, where the regex can be preceded by or followed by (or both) 'word characters' as specified by \w.

#!/usr/bin/env perl
use strict;
use warnings;
use Getopt::Std;

my %opts;
getopts('p:', \%opts) or die "Usage: $0 [-p 'regex']\n";

my $regex_base = 'heavenn';
#$regex_base = $ARGV[0] if defined $ARGV[0];
$regex_base = $opts{p} if defined $opts{p};

my $regex = '\b(\w*' . ${regex_base} . '\w*)\b';
my $rx = qr/$regex/;
print "Regex: $regex (compiled form: $rx)\n";

my %list;
my @myarr;

while (my $line = <>)
{
    while ($line =~ m/$rx/g)
    {
        print $line;
        $list{$1}++;
        #$line =~ s///;
    }
}

push @myarr, sort keys %list;

print "Matched words: @myarr\n";

Given the input file:

1. "heavenns hello"
2. "heavenns hi",
"3.heavennly good". What to d
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
An unheavenly host.  Good heavens! It heaves to like a yacht!
heaven
Is it heavens
heavenly
heavenns
abc
heavenns
heavennly

You can get outputs such as:

$ perl script.pl -p 'e\w*?ly' myfilename.txt
Regex: \b(\w*e\w*?ly\w*)\b (compiled form: (?^:\b(\w*e\w*?ly\w*)\b))
"3.heavennly good". What to d
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
An unheavenly host.  Good heavens! It heaves to like a yacht!
heavenly
heavennly
Matched words: equally heavenly heavennly heavennnly heavennnnly unheavenly
$ perl script.pl myfilename.txt
Regex: \b(\w*heavenn\w*)\b (compiled form: (?^:\b(\w*heavenn\w*)\b))
1. "heavenns hello"
2. "heavenns hi",
"3.heavennly good". What to d
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
heavenns
heavenns
heavennly
Matched words: heavennly heavennnly heavennnnly heavenns heavennsy
$

Upvotes: 1

DVK
DVK

Reputation: 129559

If you want to push only the first occurance of a word, you can add the following in your loop, after the regex:

# Assumes "my %seen;" is declared outside the loop.
next if $seen{$line}++;

More approaches to uniqueness: How do I print unique elements in Perl array?

Upvotes: 0

ikegami
ikegami

Reputation: 386706

For a given value in $_, !$seen{$_}++ is only true the first time it's executed.

my $regx = qr/heavenn/;

my @matches;
my %seen;
while (<>) {
   chomp;
   push(@mymatches, $_) if /$regx/ && !$seen{$_}++;
}

Upvotes: 1

Related Questions