packetie
packetie

Reputation: 5059

Perl performance: why the clever trick perform worse?

I found A clever trick to prealloc memory for a string, however the following code snippet perform worse than without the trick (by commenting out the statement with vec($str, 0x100000, 8)=0;.

use Time::HiRes qw( gettimeofday );
my $big = "a" x 100;
my $str = "";
vec($str, 0x100000, 8)=0;
$ts = getTS();
for ($i=0; $i < 1000000; $i ++) {
    $str = "";
    for ($j=0; $j<100; $j++) {
        $str .= $big;
    }
}

printf "took %f secs\n", getTS() - $ts;

sub getTS {
    my ($seconds, $microseconds) = gettimeofday;
    return $seconds + (0.0+ $microseconds)/1000000.0;
}

With the clever trick, it took 9.1 secs. Without the clever trick, it took 7.8 secs.

The clever trick should have been faster because it doesn't need to make so many realloc(). Any idea why?

Upvotes: 0

Views: 138

Answers (3)

ysth
ysth

Reputation: 98378

Calling vec() is an extra expensive operation; you have to be saving a whole lot of realloc data-moving to make it worth it. I'm not sure why you have nested loops in your code; any reallocs necessary will only be done in the first run of the inner loop, not later runs of it. My benchmark of your code, adjusted to have vec only allocate the buffer you actually need, shows the vec version as marginally slower:

use strict;
use warnings;
use Benchmark 'cmpthese';

cmpthese( 10, {
    'with_vec' => sub {
        my $big = "a" x 100;
        my $str;
        undef $str; # start with no string buffer for benchmarking purposes
        vec($str, 9999, 8)=0;
        for (my $i=0; $i < 1000000; $i ++) {
            $str = "";
            for (my $j=0; $j<100; $j++) {
                $str .= $big;
            }
        }
    },
    'without_vec' => sub {
        my $big = "a" x 100;
        my $str;
        undef $str; # start with no string buffer for benchmarking purposes
        vec($str, 9999, 8)=0;
        for (my $i=0; $i < 1000000; $i ++) {
            $str = "";
            for (my $j=0; $j<100; $j++) {
                $str .= $big;
            }
        }
    },
});

Producing:

            s/iter without_vec    with_vec
without_vec   8.43          --         -3%
with_vec      8.15          3%          --

(though occasionally with_vec was faster)

(undef $str forces the code to use a fresh string buffer each time; without that, $str's buffer size expands to its maximum the first time Benchmark runs the code and remains the same thereafter.)

Here's an adjusted example where preallocating does make a difference:

cmpthese( -10, {
    'with_vec' => sub {
        my $big = "a" x 1;
        my $str;
        undef $str;
        vec($str, 9999999, 8)=0;
        $str = "";
        for (my $j=0; $j<10000000; $j++) {
            $str .= $big;
        }
    },
    'without_vec' => sub {
        my $big = "a" x 1;
        my $str;
        undef $str;
        $str = "";
        for (my $j=0; $j<10000000; $j++) {
            $str .= $big;
        }
    },
});

Producing:

              Rate    with_vec without_vec
with_vec    1.29/s          --         -3%
without_vec 1.33/s          3%          --

(though results were erratic; a third of the time without_vec was faster).

Upvotes: 1

ikegami
ikegami

Reputation: 385496

Your test makes no sense. Your vec only has an effect when $i=0 —the first pass of the loop has the same affect as vec for the latter passes of the loop— so vec's pre-allocation only makes a difference for 1/1,000,000 of the time your program is executing! That means the 1.2s difference has noting to do with whether $str's string buffer is pre-allocated or not.

Did you just run each test once? That's not an appropriate way of doing a benchmark! If you run a proper test, you'll see that pre-allocating doesn't help —the gain is so minor it gets lost— but it doesn't hurt either; it simply has no effect.

                Rate  deoptimized     baseline preallocated
deoptimized  78084/s           --          -1%          -1%
baseline     78668/s           1%           --          -0%
preallocated 78928/s           1%           0%           --

Test:

use strict;
use warnings;

use Benchmark qw( cmpthese );

my $big = "a" x 100;

my $preallocated;
vec($preallocated, 0x100000, 8)=0;

cmpthese(-3, {
   deoptimized => sub {
      undef(my $str);
      $str .= $big for 1..100;
   },
   baseline => sub {
      my $str;
      $str .= $big for 1..100;
   },
   preallocated => sub {
      $preallocated = "";
      $preallocated .= $big for 1..100;
   },
});

I'm not saying pre-allocating never helps. There could be scenarios where it does —larger numbers?— just not here.

One of the reasons it has little effect is that Perl allocates exponentially more memory, which is to say the number of allocations increases only logarithmically as the loop sizes grow. The following shows only 21 reallocs for the 100 loop passes:

use strict;
use warnings;
use feature qw( say );

use B qw( svref_2object );

sub SvLEN(\$) { svref_2object($_[0])->LEN }

my $big = "a" x 100;

my $str = "";
my $incs = 0;
for (1..100) {
   my $len1 = SvLEN($str);
   $str .= $big;
   my $len2 = SvLEN($str);
   my $len_inc = $len2 - $len1;
   #say $len1, " ", $len_inc;
   ++$incs if $len_inc;
}

say $incs;  # 21

Upvotes: 2

Borodin
Borodin

Reputation: 126722

I suggest that you should avoid clever tricks. Perl's handling of string memory has improved vastly in ten years: it now pre-expands every string proportionally to its original size, and retains any memory allocated in case the program repeats the same behaviour

You can squeeze another ten percent performance out of the algorithm by using lexical variables and avoiding the C-style for loop

Also, Time::HiRes already provides tv_interval for calculating the difference between two calls to gettimeofday

use strict;
use warnings 'all';

use Time::HiRes qw/ gettimeofday tv_interval /;

my $big = 'a' x 100;

my $start = [ gettimeofday ];

for my $i (1 .. 1_000_000 ) {

    my $str;

    for my $j ( 1 .. 100 ) {
        $str .= $big;
    }
}

my $end = [ gettimeofday ];

printf "took %.3f secs\n", tv_interval( $start, $end );

output

took 8.324 secs


Incidentally, the same program running on my Pixel C tablet running Android 7.1.2 on an ARM processor returned 21.683s. I think that's pretty good going.

Upvotes: 5

Related Questions