Reputation: 323
I have created a very small sample code below to illustrate how Perl's index()
function's return value changes for empty substr
("") on string that is passed or not passed through Encode::decode()
.
use strict;
use Encode;
my $mainString = (@ARGV >= 2) ? $ARGV[1] : "abc";
my $subString = (@ARGV >= 3) ? $ARGV[2] : "";
if (@ARGV >= 1) {
$mainString = Encode::decode("utf8", $mainString);
}
my $position = index($mainString, $subString, 0);
my $loopCount = 0;
my $stopLoop = 7; # It goes for ever so set a stopping value
while ($position >= 0) {
if ($loopCount >= $stopLoop) {
last;
}
$loopCount++;
print "[$loopCount]: $position \"$mainString\" [".length($mainString)."] ($subString)\n";
$position = index($mainString, $subString, $position + 1);
}
Before getting into with vs without Encode::decode()
, what should the return value of index()
be for an empty substr
("") because Perl's documentation does not mention it. Although it does not mention it, here is the execution result without calling Encode::decode()
for ASCII characters "abc" (@ARGV = 0
):
>perl StringIndex.pl
[1]: 0 "abc" [3] ()
[2]: 1 "abc" [3] ()
[3]: 2 "abc" [3] ()
[4]: 3 "abc" [3] ()
[5]: 3 "abc" [3] ()
[6]: 3 "abc" [3] ()
[7]: 3 "abc" [3] ()
However when encoding is involved, the return value changes. The return value changes as if the string being searched is not bounded by its length when called with Encode::decode()
for ASCII characters "abc" ($ARGV[0] = 1
):
>perl StringIndex.pl 1
[1]: 0 "abc" [3] ()
[2]: 1 "abc" [3] ()
[3]: 2 "abc" [3] ()
[4]: 3 "abc" [3] ()
[5]: 4 "abc" [3] ()
[6]: 5 "abc" [3] ()
[7]: 6 "abc" [3] ()
As a Side Note:
substr
is set to empty string ("") in above example, but in my real program it is a variable that changes value depending on condition.substr
is empty and not enter the while
loopUpvotes: 3
Views: 181
Reputation: 386561
This would be considered a bug, which I've reported here.
Minimal code to reproduce:
use strict;
use warnings;
no warnings qw( void );
use feature qw( say );
my $s = "abc";
my $len = length($s);
utf8::upgrade($s);
length($s) if $ARGV[0];
say index($s, "", $len+1);
$ perl a.pl 0
3
$ perl a.pl 1
4
Perl has two string storage formats. The "upgraded" format, and the "downgraded" format.
Encode::decode
always return an upgraded string. And utf8::upgrade
tells Perl to switch the storage format used by a scalar.
Each character of a downgraded string can store a number between 0 and 255. Each character of the string is stored as a byte of the appropriate value. This, of course, is fine if you have bytes or ASCII text. But this is insufficient for arbitrary text.
Each character of an upgraded string can store a number between 0 and 232-1 or between 0 and 264-1 depending on how your Perl was compiled. This is more than enough to store any Unicode Code Point (even those outside the BMP). Each character is encoded using "utf8", a nonstandard extension of UTF-8.
utf8 (like UTF-8) is variable-length encoding. This presents two problems:
Let's consider the following snippet:
index($str, $substr, $pos)
With a downgraded string, index
can jump directly to the position indicated by $pos
. It's a question of simple pointer arithmetic.
But because each character of an upgrade string can require a different amount of storage, index
can't use pointer arithmetic to find the character at position $pos
. Without optimizations, each use to index
would have to start at offset 0 and move through the string until it finds the character indicated by $pos
.
That would be unfortunate. Imagine if index
was being used in a loop to find all matches. So Perl optimizes this! When the length of an upgraded string becomes known, Perl attaches it to the scalar.
$ perl -MDevel::Peek -e'
$s = "abc";
utf8::upgrade($s);
Dump($s);
length($s);
Dump($s);
'
SV = PV(0x56483dda7e80) at 0x56483ddd5ba0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x56483ddda3f0 "abc"\0 [UTF8 "abc"]
CUR = 3
LEN = 10
SV = PVMG(0x56483de0ecf0) at 0x56483ddd5ba0
REFCNT = 1
FLAGS = (SMG,POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x56483ddda3f0 "abc"\0 [UTF8 "abc"]
CUR = 3
LEN = 10
MAGIC = 0x56483ddd4050
MG_VIRTUAL = &PL_vtbl_utf8
MG_TYPE = PERL_MAGIC_utf8(w)
MG_LEN = 3 <-- Attached length
Similarly, the offset of characters is sometimes attached to the scalar as well!
$ perl -MDevel::Peek -e'
$s = "abc";
utf8::upgrade($s);
Dump($s);
index($s, "", 2);
Dump($s);
'
SV = PV(0x558d5c970e80) at 0x558d5c99ebc0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x558d5c9ae3a0 "abc"\0 [UTF8 "abc"]
CUR = 3
LEN = 10
SV = PVMG(0x558d5c9d7d10) at 0x558d5c99ebc0
REFCNT = 1
FLAGS = (SMG,POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x558d5c9ae3a0 "abc"\0 [UTF8 "abc"]
CUR = 3
LEN = 10
MAGIC = 0x558d5c9af690
MG_VIRTUAL = &PL_vtbl_utf8
MG_TYPE = PERL_MAGIC_utf8(w)
MG_LEN = -1
MG_PTR = 0x558d5c99cb80
0: 2 -> 2 <-- Attached character offset
1: 0 -> 0 <-- Attached character offset
The difference in behaviour is due to different code being paths in the code being exercised based on the string format and what information is cached.
Upvotes: 3