7akeoverforce
7akeoverforce

Reputation: 51

how to match multiple items in perl

my $text ='<span>by <small class="author" itemprop="author">J.K. Rowling</small><span>by <small class="author" itemprop="author">J.K. Rowling</small><span>by <small class="author" itemprop="author">J.K. Rowling</small>'


if ($text =~ m/<span>by <small class="author" itemprop="author">(.+?)<\/small>/ig){
    $author = $1;
    $authorcount{$author} +=1;
}

$authorcounttxt = "authorcount.txt";
open (OUTPUT3, ">$authorcounttxt");
foreach $author (sort { $authorcount{$b} <=> $authorcount{$a} } keys %authorcount){
    print OUTPUT3 ("$author\t\t$authorcount{$author}\n");
}
close (OUTPUT3);

The desired output is:

J.K. Rowling 3

However I am only getting:

J.K. Rowling 1

Upvotes: 1

Views: 100

Answers (3)

Polar Bear
Polar Bear

Reputation: 6798

As already indicated by previous posters the issue hidden in if ( $text =~ /.../gi ), it evaluates to true and block executed only once.

You are looking to process match in an array context which can be achieved with for or while loop.

Following code snippet demonstrates one of many approaches to the solution.

use strict;
use warnings;
use feature 'say';

my(%authors, $fname, $text, $re);

$fname = 'authorcount.txt';
$text  = '<span>by <small class="author" itemprop="author">J.K. Rowling</small><span>by <small class="author" itemprop="author">J.K. Rowling</small><span>by <small class="author" itemprop="author">J.K. Rowling</small>';
$re    = qr/<span>by <small class="author" itemprop="author">(.*?)<\/small>/;

$authors{$1}++ for $text =~ /$re/gi;

open my $fh, ">", $fname
    or die "Can't open $fname";
    
say $fh "$_ $authors{$_}" for sort keys %authors;

close $fh;

NOTE: this code will work for your example $text = '...', if you intend to process complex HTML files then Mojo::DOM is a right tool to a problem.

Upvotes: 1

hartenfels
hartenfels

Reputation: 106

Replace your if with a while to iterate through all of the matches of your regex match instead of only the first one:

while ($text =~ m/<span>by <small class="author" itemprop="author">(.+?)<\/small>/ig){
  $author = $1;
  $authorcount{$author} += 1;
}

Also obligatory note: parsing HTML with regexen is fraught with peril. Consider using a module that can properly parse HTML, Mojo::DOM for example.

Upvotes: 1

Steffen Ullrich
Steffen Ullrich

Reputation: 123380

if ($text =~ m/.../ig){
     $author = $1;
     $authorcount{$author} +=1;

This is an if statement which means that the inner block while be entered at most once, i.e. if there is a first match. You likely meant to do a while statement to enter the inner block for each match:

while ($text =~ m/.../ig){
     $author = $1;
     $authorcount{$author} +=1;

Upvotes: 1

Related Questions