mak
mak

Reputation: 323

Perl's odd behavior of index() when called with empty substr with vs without Encode::decode()

I have created a very small sample code below to illustrate how Perl's index() function's return value changes for empty substr ("") on string that is passed or not passed through Encode::decode().

use strict;
use Encode;

my $mainString = (@ARGV >= 2) ? $ARGV[1] : "abc";
my $subString  = (@ARGV >= 3) ? $ARGV[2] : "";
if (@ARGV >= 1) {
    $mainString = Encode::decode("utf8", $mainString);
}

my $position = index($mainString, $subString, 0);
my $loopCount = 0;
my $stopLoop  = 7; # It goes for ever so set a stopping value
while ($position >= 0) {
    if ($loopCount >= $stopLoop) {
        last;
    }
    $loopCount++;

    print "[$loopCount]: $position \"$mainString\" [".length($mainString)."] ($subString)\n";
    $position = index($mainString, $subString, $position + 1);
}

Before getting into with vs without Encode::decode(), what should the return value of index() be for an empty substr ("") because Perl's documentation does not mention it. Although it does not mention it, here is the execution result without calling Encode::decode() for ASCII characters "abc" (@ARGV = 0):

>perl StringIndex.pl
[1]: 0 "abc" [3] ()
[2]: 1 "abc" [3] ()
[3]: 2 "abc" [3] ()
[4]: 3 "abc" [3] ()
[5]: 3 "abc" [3] ()
[6]: 3 "abc" [3] ()
[7]: 3 "abc" [3] ()

However when encoding is involved, the return value changes. The return value changes as if the string being searched is not bounded by its length when called with Encode::decode() for ASCII characters "abc" ($ARGV[0] = 1):

>perl StringIndex.pl 1
[1]: 0 "abc" [3] ()
[2]: 1 "abc" [3] ()
[3]: 2 "abc" [3] ()
[4]: 3 "abc" [3] ()
[5]: 4 "abc" [3] ()
[6]: 5 "abc" [3] ()
[7]: 6 "abc" [3] ()

As a Side Note:

  1. substr is set to empty string ("") in above example, but in my real program it is a variable that changes value depending on condition.
  2. I understand the simplest solution is to check if substr is empty and not enter the while loop
  3. I am using "This is perl 5, version 28, subversion 1 (v5.28.1) built for MSWin32-x64-multi-thread"

Upvotes: 3

Views: 181

Answers (1)

ikegami
ikegami

Reputation: 386561

This would be considered a bug, which I've reported here.


Minimal code to reproduce:

use strict;
use warnings;
no warnings qw( void );
use feature qw( say );

my $s = "abc";
my $len = length($s);
utf8::upgrade($s);
length($s) if $ARGV[0];
say index($s, "", $len+1);
$ perl a.pl 0
3

$ perl a.pl 1
4

Perl has two string storage formats. The "upgraded" format, and the "downgraded" format.

Encode::decode always return an upgraded string. And utf8::upgrade tells Perl to switch the storage format used by a scalar.

Each character of a downgraded string can store a number between 0 and 255. Each character of the string is stored as a byte of the appropriate value. This, of course, is fine if you have bytes or ASCII text. But this is insufficient for arbitrary text.

Each character of an upgraded string can store a number between 0 and 232-1 or between 0 and 264-1 depending on how your Perl was compiled. This is more than enough to store any Unicode Code Point (even those outside the BMP). Each character is encoded using "utf8", a nonstandard extension of UTF-8.

utf8 (like UTF-8) is variable-length encoding. This presents two problems:

  • Determining the length of an upgraded string requires iterating over the entire string.
  • Determining the position of characters in a upgraded string requires iterating over the entire string.

Let's consider the following snippet:

index($str, $substr, $pos)

With a downgraded string, index can jump directly to the position indicated by $pos. It's a question of simple pointer arithmetic.

But because each character of an upgrade string can require a different amount of storage, index can't use pointer arithmetic to find the character at position $pos. Without optimizations, each use to index would have to start at offset 0 and move through the string until it finds the character indicated by $pos.

That would be unfortunate. Imagine if index was being used in a loop to find all matches. So Perl optimizes this! When the length of an upgraded string becomes known, Perl attaches it to the scalar.

$ perl -MDevel::Peek -e'
   $s = "abc";
   utf8::upgrade($s);
   Dump($s);
   length($s);
   Dump($s);
'
SV = PV(0x56483dda7e80) at 0x56483ddd5ba0
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x56483ddda3f0 "abc"\0 [UTF8 "abc"]
  CUR = 3
  LEN = 10
SV = PVMG(0x56483de0ecf0) at 0x56483ddd5ba0
  REFCNT = 1
  FLAGS = (SMG,POK,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x56483ddda3f0 "abc"\0 [UTF8 "abc"]
  CUR = 3
  LEN = 10
  MAGIC = 0x56483ddd4050
    MG_VIRTUAL = &PL_vtbl_utf8
    MG_TYPE = PERL_MAGIC_utf8(w)
    MG_LEN = 3                           <-- Attached length

Similarly, the offset of characters is sometimes attached to the scalar as well!

$ perl -MDevel::Peek -e'
   $s = "abc";
   utf8::upgrade($s);
   Dump($s);
   index($s, "", 2);
   Dump($s);
'
SV = PV(0x558d5c970e80) at 0x558d5c99ebc0
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x558d5c9ae3a0 "abc"\0 [UTF8 "abc"]
  CUR = 3
  LEN = 10
SV = PVMG(0x558d5c9d7d10) at 0x558d5c99ebc0
  REFCNT = 1
  FLAGS = (SMG,POK,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x558d5c9ae3a0 "abc"\0 [UTF8 "abc"]
  CUR = 3
  LEN = 10
  MAGIC = 0x558d5c9af690
    MG_VIRTUAL = &PL_vtbl_utf8
    MG_TYPE = PERL_MAGIC_utf8(w)
    MG_LEN = -1
    MG_PTR = 0x558d5c99cb80
       0: 2 -> 2                     <-- Attached character offset
       1: 0 -> 0                     <-- Attached character offset

The difference in behaviour is due to different code being paths in the code being exercised based on the string format and what information is cached.

Upvotes: 3

Related Questions