Jim Black
Jim Black

Reputation: 1482

Fastest method for checking if a LF is at the end of a large scalar in Perl?

I've come up with the following to check the final character of a $scaler for a linefeed:

if( $buffer !~ /\n$/ ) {
if( substr( $buffer, -1, 1 ) !~ /\n/ ) {
if( substr( $buffer, -1, 1 ) ne '\n' ) {

Is there a faster method I could? The size of the $buffer scalar can get large and I've noticed that the larger it gets, the longer these conditionals take to run. I do have another scalar containing the length of $buffer, if that would help.

Thanks

The full code:

#!/usr/bin/perl
use strict;
use warnings;
use Fcntl qw();
use Time::HiRes qw( gettimeofday tv_interval );

use constant BUFSIZE => 2 ** 21; # 2MB worked best for me, YMMV.

die "ERROR: Missing filename" if( !$ARGV[0] );

my $top = [gettimeofday];
sysopen( my $fh, $ARGV[0], Fcntl::O_RDONLY | Fcntl::O_BINARY ) or
  die "ERROR: Unable to open $ARGV[0], because $!\n";
open my $output, ">", "/dev/null";  # for 'dummy' processing

my $size = -s $ARGV[0];
my $osiz = $size;
my( $buffer, $offset, $lnCtr ) = ( "", "", 0 );
while( $size ) {
    my $read = sysread( $fh, $buffer, BUFSIZE, length($offset) );
    $size -= $read;
    my @lines = split /\n/, $buffer;
    if( substr( $buffer, -1, 1 ) ne "\n" ) {
        $offset = pop( @lines );
    } else {
        $offset = "";
    }
    for my $line ( @lines ) {
        processLine( \$line );
        $lnCtr++;
    }
    $buffer = $offset if( $offset );
}
close $fh;
print "Processed $lnCtr lines ($osiz bytes) in file: $ARGV[0] in ".
      tv_interval( $top ).
      " secs.\n";
print "Using a buffered read of ".BUFSIZE." bytes.  -  JLB\n";

sub processLine {
    if( ref($_[0]) ) {
        print $output ${$_[0]}."\n";
    } else {
        print $output $_[0]."\n";
    }
    return 0;
}

I think I've reached that 'point-of-diminishing returns' in my attempts of making this run any faster. It seems to now be able to read in data as fast as my RAID5 SSDs are able to fetch it. As you can see, there is a reason I didn't use chomp(), the input can contain hundreds of thousands of linefeeds, which I need to keep to be able to break the lines for processing.

./fastread.pl newdata.log Processed 516670 lines (106642635 bytes) in file: newdata.log in 0.674738 secs. Using a buffered read of 2097152 bytes. - JLB

Upvotes: 0

Views: 193

Answers (5)

ikegami
ikegami

Reputation: 386386

Perl has two string storage formats.

One of the formats uses the same number of bytes (1) to store each possible character the string can contain. Because of that and because Perl keeps track of how many bytes is used by a string, the performance of substr($x, -1) on a string in this format does not depend on the the length of the string.

The problem with the aforementioned format is that it can only store a very limited range of characters. It could be used to store the Unicode code points "Eric" and "Éric", but not for "Ελλάδα". When necessary (and even when not necessary), Perl will automatically switch a string's storage format to the other format.

The second format can store any Unicode code point as a character. In fact, it can store any 32-bit or 64-bit value (depending on perl's build settings). The downside is that a variable number of bytes is used to store each character. So even though Perl knows the number of bytes used by the entire string, it doesn't know where any character but the first one starts.* To find the last character, it must scan the entire string.

That said, because of properties of the storage format, it would actually be quite easy to find the last char of a string in constant time.

use Inline C => <<'__END_OF_C__';

   # O(1) version of substr($x,-1)
   SV* last_char(SV* sv) {
      STRLEN len;
      const char* s = SvPV(sv, len);

      if (!len)
         return newSVpvn("", 0);

      {
         const U32 utf8 = SvUTF8(sv);
         const char* p = s+len-1;         
         if (utf8) {
            while (p != s && (*p & 0xC0) != 0xC0)
               --p;
         }

         return newSVpvn_utf8(p, s+len-p, utf8);
      }
   }

__END_OF_C__

* — It does keep a cache of the couple of char position to byte position mappings.


You've shown code which can be cleaned up so you don't even need to check the last char for a newline.

sub processLine {
   print $_[0] $_[1];
}


open(my $fh, '<:raw', $ARGV[0])
   or die("Can't open $ARGV[0]: $!\n");

my $buffer = '';
my $lnCtr = 0;
while (1) {
   my $rv = sysread($fh, $buffer, BUFSIZE, length($buffer));
   die $! if !defined($rv);
   last if !$rv;

   while ($buffer =~ s/(.*\n)//) {
      processLine($1);
      ++$lnCtr;
   }
}

if (length($buffer)) {
   processLine($output, $buffer);
   ++$lnCtr;
}

Notes:

  • No need for sysopen. open is simpler.
  • If you pass $buffer to sysread, it doesn't make sense to use length($offset).
  • As you can see, $offset and the copying thereof is completely unnecessary.
  • Passing a var to a sub does not copy it, so no need to pass a reference.
  • If processLine doesn't need the newline, use s/(.*)\n// instead.

Upvotes: 3

Toto
Toto

Reputation: 91488

Here is a Benchmark:

#!/usr/bin/perl 
use strict;
use warnings;
use Benchmark qw(:all);

my $buffer = 'abc'x10_000_000;
$buffer .= "\n";
my $count = -2;
cmpthese($count, {
    'regex' => sub {
        if ($buffer !~ /\n$/) { }
    },
    'substr + regex' => sub {
        if (substr($buffer, -1, 1) !~ /\n$/) { }
    },
    'substr + ne' => sub {
        if (substr($buffer, -1, 1) ne "\n") { }
    },
    'chomp' => sub {
        if (chomp $buffer) { }
    },
});

Output:

                     Rate substr + regex  substr + ne         regex        chomp
substr + regex  6302468/s             --         -11%          -44%         -70%
substr + ne     7072032/s            12%           --          -37%         -66%
regex          11294695/s            79%          60%            --         -46%
chomp          20910531/s           232%         196%           85%           --

chomp is certainly the fastest way.

Upvotes: 1

Andy Lester
Andy Lester

Reputation: 93735

Why are you concerned about speed? Is this piece of code in a part of your program that is measurably slow, perhaps profiled with Devel::NYTProf? If not, then I suggest you go with what is the clearest to read and the most idiomatic, which is probably

if( $buffer !~ /\n$/ )

Your final version:

if( substr( $buffer, -1, 1 ) ne '\n' )

would also be a fine choice except for your single-quoting the linefeed, thus giving you a two-character string consisting of a backslash and a lowercase n. Perhaps you're coming from C where single characters are single quoted and strings are double-quoted? You want

if( substr( $buffer, -1, 1 ) ne "\n" )

This version

if( substr( $buffer, -1, 1 ) !~ /\n/ )

is doing a regex match that it shouldn't be because it's checking a one-character string against a single-character regex. The next person to read the code will think that's strange and wonder why you'd do that. Also, back to that speed thing, it's slower to match a string against a regex than just compare against a single character for equality.

Upvotes: 1

David W.
David W.

Reputation: 107080

You can try chomp. Chomp will return the number of EOL characters removed from the end of a line:

if ( chomp $buffer ) {
    print "You had an LF on the end of \$buffer";
}

Of course, chomp removes the NL characters it counted.

Upvotes: 0

NovaDenizen
NovaDenizen

Reputation: 5325

I suspect perl is treating the string as utf-8 and has to iterate over the whole thing for some reason.

You could temporarily switch to byte semantics to see if the char on the end is a newline.

See docs for Perl's bytes pragma and perlunicode.

Upvotes: 0

Related Questions