Amber O
Amber O

Reputation: 67

Counting occurrences of a word in a string in Perl

I am trying to find out the number of occurrences of "The/the". Below is the code I tried"

print ("Enter the String.\n");
$inputline = <STDIN>;
chop($inputline);
$regex="\[Tt\]he";
if($inputline ne "")
{

 @splitarr= split(/$regex/,$inputline);
}

$scalar=@splitarr;
print $scalar;

The string is :

Hello the how are you the wanna work on the project but i the u the The

The output that it gives is 7. However with the string :

Hello the how are you the wanna work on the project but i the u the

the output is 5. I suspect my regex. Can anyone help in pointing out what's wrong.

Upvotes: 0

Views: 2006

Answers (4)

Liudvikas Bukys
Liudvikas Bukys

Reputation: 5870

The following snippet uses a code side-effect to increment a counter, followed by an always-failing match to keep searching. It produces the correct answer for matches that overlap (e.g. "aaaa" contains "aa" 3 times, not 2). The split-based answers don't get that right.

my $i;
my $string;

$i = 0;
$string = "aaaa";
$string =~ /aa(?{$i++})(?!)/;
print "'$string' contains /aa/ x $i (should be 3)\n";

$i = 0;
$string = "Hello the how are you the wanna work on the project but i the u the The";
$string =~ /[tT]he(?{$i++})(?!)/;
print "'$string' contains /[tT]he/ x $i (should be 6)\n";

$i = 0;
$string = "Hello the how are you the wanna work on the project but i the u the";
$string =~ /[tT]he(?{$i++})(?!)/;
print "'$string' contains /[tT]he/ x $i (should be 5)\n";

Upvotes: 1

choroba
choroba

Reputation: 241808

With split, you're counting the substrings between the the's. Use match instead:

#!/usr/bin/perl
use warnings;
use strict;

my $regex = qr/[Tt]he/;

for my $string ('Hello the how are you the wanna work on the project but i the u the The',
                'Hello the how are you the wanna work on the project but i the u the',
                'the theological cathedral'
               ) {
    my $count = () = $string =~ /$regex/g;
    print $count, "\n";

    my @between = split /$regex/, $string;
    print 0 + @between, "\n";

    print join '|', @between;
    print "\n";
}

Note that both methods return the same number for the two inputs you mentioned (and the first one returns 6, not 7).

Upvotes: 1

Borodin
Borodin

Reputation: 126722

I get the correct number - 6 - for the first string

However your method is wrong, because if you count the number of pieces you get by splitting on the regex pattern it will give you different values depending on whether the word appears at the beginning of the string. You should also put word boundaries \b into your regular expression to prevent the regex from matching something like theory

Also, it is unnecessary to escape the square brackets, and you can use the /i modifier to do a case-independent match

Try something like this instead

use strict;
use warnings;

print 'Enter the String: ';
my $inputline = <>;
chomp $inputline;

my $regex = 'the';

if ( $inputline ne '' ) {
    my @matches = $inputline =~ /\b$regex\b/gi;
    print scalar @matches, " occurrences\n";
}

Upvotes: 3

Rakholiya Jenish
Rakholiya Jenish

Reputation: 3223

What you need is 'countof' operator to count the number of matches:

my $string = "Hello the how are you the wanna work on the project but i the u the The";
my $count = () = $string =~/[Tt]he/g;
print $count;

If you want to select only the word the or The, add word boundary:

my $string = "Hello the how are you the wanna work on the project but i the u the The";
my $count = () = $string =~/\b[Tt]he\b/g;
print $count;

Upvotes: 0

Related Questions