Reputation: 467
I would like to append a string in perl within a loop in a fast way, without having to copy the string for each iteration. I'm looking for something like StringBuilder from Java or C#.
I currently know the following alternatives in mind, in order to do 'a += b'.
I am not interested in copying all string to the other. I need to copy one character per time, or append small strings foreach iteration. I am trying to solve the following problem: compress the input string 'aaabbccc' to '3a2b3c'. So the idea is to iterate over the input string, check how many repeated characters we have, and then append to the output in the compressed way. What is the most efficient to perform this in perl ?
Here is a link to the problem I was trying to solve. I's slightly different though.
Upvotes: 3
Views: 882
Reputation: 40758
For comparsion, I tried to test different versions for solving your actual problem of compressing the string. Here is my test script test.pl
:
use strict;
use warnings;
use Benchmark qw(cmpthese);
use Inline C => './compress_c.c';
my $str_len = 10000;
my @chars = qw(a b c d);
my $str;
$str .= [@chars]->[rand 4] for 1 .. $str_len;
cmpthese(
-1,
{
compress_array => sub { compress_array( $str ) },
compress_regex => sub { compress_regex( $str ) },
compress_str => sub { compress_str( $str ) },
compress_c => sub { compress_c( $str ) },
}
);
# Suggested by @melpomene in the comments
sub compress_regex {
return $_[0] =~ s/([a-z])\1+/($+[0] - $-[0]) . $1/egr;
}
sub compress_array {
my $result = '';
my @chrs = split //, $_[0];
my $prev = $chrs[0];
my $count = 1;
my @result;
for my $i ( 1..$#chrs ) {
my $char = $chrs[$i];
if ( $prev eq $char ) {
$count++;
next if $i < $#chrs;
}
if ( $count > 1) {
push @result, $count, $prev;
}
else {
push @result, $prev;
}
if ( ( $i == $#chrs ) and ( $prev ne $char ) ) {
push @result, $char;
last;
}
$count = 1;
$prev = $char;
}
return join '', @result;
}
sub compress_str {
my $result = '';
my $prev = substr $_[0], 0, 1;
my $count = 1;
my $lastind = (length $_[0]) - 1;
for my $i (1 .. $lastind) {
my $char = substr $_[0], $i, 1;
if ( $prev eq $char ) {
$count++;
next if $i < $lastind;
}
if ( $count > 1) {
$result .= $count;
}
$result .= $prev;
if ( ( $i == $lastind ) and ( $prev ne $char ) ) {
$result .= $char;
last;
}
$count = 1;
$prev = $char;
}
return $result;
}
where compress_c.c
is:
SV *compress_c(SV* str_sv) {
STRLEN len;
char* str = SvPVbyte(str_sv, len);
SV* result = newSV(len);
char *buf = SvPVX(result);
char prev = str[0];
int count = 1;
int j = 0;
int i;
for (i = 1; i < len; i++ )
{
char cur = str[i];
if ( prev == cur ) {
count++;
if ( i < (len - 1) )
continue;
}
if ( count > 1) {
buf[j++] = count + '0'; // assume count is less than 10
}
buf[j++] = prev;
if ( (i == (len - 1)) && (prev != cur) ) buf[j++] = cur;
count = 1;
prev = cur;
}
buf[j] = '\0';
SvPOK_on(result);
SvCUR_set(result, j);
return result;
}
The result of running perl test.pl
:
Rate compress_array compress_str compress_regex compress_c
compress_array 311/s -- -42% -45% -99%
compress_str 533/s 71% -- -6% -98%
compress_regex 570/s 83% 7% -- -98%
compress_c 30632/s 9746% 5644% 5273% --
Which shows that regex version is slightly faster than the string version. However, the C version is the fastest, and it is about 50 times as fast as the regex version.
Note: I tested this on my Ubuntu 16.10 laptop (Intel Core i7-7500U CPU @ 2.70GHz)
Upvotes: 4
Reputation: 467
I've performed the following benchmark in several ways to perform that:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(cmpthese);
my $dna;
$dna .= [qw(G A T C)]->[rand 4] for 1 .. 10000;
sub frequency_concat {
my $result = '';
for my $idx (0 .. length($dna) - 1) {
$result .= substr($dna, $idx, 1);
}
return $result;
}
sub frequency_join {
my $result = '';
for my $idx (0 .. length($dna) - 1) {
$result = join '', $result, substr($dna,$idx,1);
}
return $result;
}
sub frequency_list_push {
my @result = ();
for my $idx (0 .. length($dna) - 1) {
push @result, substr($dna,$idx,1);
}
return join '', @result;
}
sub frequency_list_prealloc {
my @result = (' ' x length($dna));
for my $idx (0 .. length($dna) - 1) {
$result[$idx] = substr($dna,$idx,1);
}
return join '', @result;
}
cmpthese(-1, # Run each for at least 1 second(s) {
concat => \&frequency_concat,
join => \&frequency_join,
list_push => \&frequency_list_push,
list_list_prealloc => \&frequency_list_prealloc
}
);
The results below have shown that the concat (a . b) is the fastest operation. I don't understand why, since this will need to make several copies of the string.
Rate join list_push list_list_prealloc concat
join 213/s -- -38% -41% -74%
list_push 342/s 60% -- -5% -58%
list_list_prealloc 359/s 68% 5% -- -56%
concat 822/s 285% 140% 129% --
Upvotes: 2