Eli
Eli

Reputation: 39009

What's the fastest way to count the number of words in a string in Perl?

I have a few functions that I'm running over a million times on various texts, which means small improvements in these functions translate to big gains overall. Currently, I've noticed that all of my functions which involve word counts take drastically longer to run than everything else, so I'm thinking I want to try doing word count in a different way.

Basically, what my function does is grab a number of objects that have text associated with them, verify that that text doesn't match certain patterns, and then count the number of words in that text. A basic version of the function is:

my $num_words = 0;
for (my $i=$begin_pos; $i<=$end_pos; $i++) {
   my $text = $self->_getTextFromNode($i);
   #If it looks like a node full of bogus text, or just a number, remove it.
   if ($text =~ /^\s*\<.*\>\s*$/ && $begin_pos == $end_pos) { return 0; }
   if ($text =~ /^\s*(?:Page\s*\d+)|http/i && $begin_pos == $end_pos) { return 0; }
   if ($text =~ /^\s*\d+\s*$/ && $begin_pos == $end_pos) { return 0; }
   my @text_words = split(/\s+/, $text);
   $num_words += scalar(@text_words);
   if ($num_words > 30) { return 30; }
}
return $num_words;
}

I'm doing plenty of text comparisons similar to what I'm doing here elsewhere in my code, so I'm guessing my problem must be with my word counting. Is there a faster way to do it than splitting on \s+? If so, what is it and why is it faster (so I can understand what I'm doing wrong and can apply that knowledge to similar problems later on).

Upvotes: 7

Views: 13973

Answers (5)

hexcoder
hexcoder

Reputation: 1220

If words are seperated only by single spaces, counting spaces is fast.

sub count1
{
    my $str = shift;
    return 1 + ($str =~ tr{ }{ });
}

updated benchmark:

my $text = 'asdf asdf asdf asdf asdf';

sub count_array {
   my @text_words = split(/\s+/, $text);
   scalar(@text_words);
}

sub count_list {
   my $x =()= $text =~ /\S+/g;       #/
}

sub count_while {
   my $num; 
   $num++ while $text =~ /\S+/g;     #/
   $num
}

sub count_tr {
    1 + ($text =~ tr{ }{ });
}

say count_array; # 5
say count_list;  # 5
say count_while; # 5
say count_tr; # 5

use Benchmark 'cmpthese';

cmpthese -2 => {
    array => \&count_array,
    list  => \&count_list,
    while => \&count_while,
    tr    => \&count_tr,
}

#            Rate  list while array    tr
# list   220911/s    --  -24%  -44%  -94%
# while  291225/s   32%    --  -26%  -92%
# array  391769/s   77%   35%    --  -89%
# tr    3720197/s 1584% 1177%  850%    --

Upvotes: 5

Eugene Yarmash
Eugene Yarmash

Reputation: 150178

Since you restrict the number of words to 30, you can return from the function earlier:

while ($text =~ /\S+/g) {
    ++$num_words == 30 && return $num_words;
}    
return $num_words;

Or using split:

$num_words = () = split /\s+/, $text, 30;

Upvotes: 2

DCharness
DCharness

Reputation: 276

For correctness, from aleroot's answer, you probably want split " ", not the original split /\s+/ to avoid a fencepost error: A "split" on "/\s+/" is like a "split(' ')" except that any leading whitespace produces a null first field.* That difference will give you one extra word (the null first field, that is) per line.

For speed, since you're capping the number of words at 30, you probably want to use the LIMIT argument*: split " ", $str, 30.

On the other hand, other answers wisely point you away from split altogether, since you don't need the list of words, just their count.

Upvotes: 2

Eric Strom
Eric Strom

Reputation: 40152

Using a while loop with a regex is the fastest way that I have found to count words:

my $text = 'asdf asdf asdf asdf asdf';

sub count_array {
   my @text_words = split(/\s+/, $text);
   scalar(@text_words);
}

sub count_list {
    my $x =()= $text =~ /\S+/g;       #/
}

sub count_while {
    my $num; 
    $num++ while $text =~ /\S+/g;     #/
    $num
}

say count_array; # 5
say count_list;  # 5
say count_while; # 5

use Benchmark 'cmpthese';

cmpthese -2 => {
    array => \&count_array,
    list  => \&count_list,
    while => \&count_while,
}

#          Rate  list array while
# list  303674/s    --  -22%  -55%
# array 389212/s   28%    --  -42%
# while 675295/s  122%   74%    --

The while loop is faster because memory does not need to be allocated for each of the found words. Also the regex is in boolean context which means it does not need to extract the actual match from the string.

Upvotes: 15

Ted Hopp
Ted Hopp

Reputation: 234857

Since you only need the count of words instead of the array of words, it would be good to avoid using split. Something this might work:

$num_words += $text =~ s/((^|\s)\S)/$1/g;

It replaces the work of building a word array with the work of substituting each word with itself. You'd need to benchmark it to see whether it is faster.

EDIT: This might be faster:

++$num_words while $text =~ /\S+/g;

Upvotes: 1

Related Questions