Reputation: 39009
I have a few functions that I'm running over a million times on various texts, which means small improvements in these functions translate to big gains overall. Currently, I've noticed that all of my functions which involve word counts take drastically longer to run than everything else, so I'm thinking I want to try doing word count in a different way.
Basically, what my function does is grab a number of objects that have text associated with them, verify that that text doesn't match certain patterns, and then count the number of words in that text. A basic version of the function is:
my $num_words = 0;
for (my $i=$begin_pos; $i<=$end_pos; $i++) {
my $text = $self->_getTextFromNode($i);
#If it looks like a node full of bogus text, or just a number, remove it.
if ($text =~ /^\s*\<.*\>\s*$/ && $begin_pos == $end_pos) { return 0; }
if ($text =~ /^\s*(?:Page\s*\d+)|http/i && $begin_pos == $end_pos) { return 0; }
if ($text =~ /^\s*\d+\s*$/ && $begin_pos == $end_pos) { return 0; }
my @text_words = split(/\s+/, $text);
$num_words += scalar(@text_words);
if ($num_words > 30) { return 30; }
}
return $num_words;
}
I'm doing plenty of text comparisons similar to what I'm doing here elsewhere in my code, so I'm guessing my problem must be with my word counting. Is there a faster way to do it than splitting on \s+
? If so, what is it and why is it faster (so I can understand what I'm doing wrong and can apply that knowledge to similar problems later on).
Upvotes: 7
Views: 13973
Reputation: 1220
If words are seperated only by single spaces, counting spaces is fast.
sub count1
{
my $str = shift;
return 1 + ($str =~ tr{ }{ });
}
updated benchmark:
my $text = 'asdf asdf asdf asdf asdf';
sub count_array {
my @text_words = split(/\s+/, $text);
scalar(@text_words);
}
sub count_list {
my $x =()= $text =~ /\S+/g; #/
}
sub count_while {
my $num;
$num++ while $text =~ /\S+/g; #/
$num
}
sub count_tr {
1 + ($text =~ tr{ }{ });
}
say count_array; # 5
say count_list; # 5
say count_while; # 5
say count_tr; # 5
use Benchmark 'cmpthese';
cmpthese -2 => {
array => \&count_array,
list => \&count_list,
while => \&count_while,
tr => \&count_tr,
}
# Rate list while array tr
# list 220911/s -- -24% -44% -94%
# while 291225/s 32% -- -26% -92%
# array 391769/s 77% 35% -- -89%
# tr 3720197/s 1584% 1177% 850% --
Upvotes: 5
Reputation: 150178
Since you restrict the number of words to 30, you can return from the function earlier:
while ($text =~ /\S+/g) {
++$num_words == 30 && return $num_words;
}
return $num_words;
Or using split
:
$num_words = () = split /\s+/, $text, 30;
Upvotes: 2
Reputation: 276
For correctness, from aleroot's answer, you probably want split " "
, not the original split /\s+/
to avoid a fencepost error:
A "split" on "/\s+/" is like a "split(' ')" except that any leading whitespace produces a null first field.* That difference will give you one extra word (the null first field, that is) per line.
For speed, since you're capping the number of words at 30, you probably want to use the LIMIT
argument*: split " ", $str, 30
.
On the other hand, other answers wisely point you away from split
altogether, since you don't need the list of words, just their count.
Upvotes: 2
Reputation: 40152
Using a while loop with a regex is the fastest way that I have found to count words:
my $text = 'asdf asdf asdf asdf asdf';
sub count_array {
my @text_words = split(/\s+/, $text);
scalar(@text_words);
}
sub count_list {
my $x =()= $text =~ /\S+/g; #/
}
sub count_while {
my $num;
$num++ while $text =~ /\S+/g; #/
$num
}
say count_array; # 5
say count_list; # 5
say count_while; # 5
use Benchmark 'cmpthese';
cmpthese -2 => {
array => \&count_array,
list => \&count_list,
while => \&count_while,
}
# Rate list array while
# list 303674/s -- -22% -55%
# array 389212/s 28% -- -42%
# while 675295/s 122% 74% --
The while loop is faster because memory does not need to be allocated for each of the found words. Also the regex is in boolean context which means it does not need to extract the actual match from the string.
Upvotes: 15
Reputation: 234857
Since you only need the count of words instead of the array of words, it would be good to avoid using split
. Something this might work:
$num_words += $text =~ s/((^|\s)\S)/$1/g;
It replaces the work of building a word array with the work of substituting each word with itself. You'd need to benchmark it to see whether it is faster.
EDIT: This might be faster:
++$num_words while $text =~ /\S+/g;
Upvotes: 1