Reputation: 23
I'm using RegExp to match a series of bytes other than 0x1b
from the sequence [0xcb, 0x98, 0x1b]
:
var r = /[^\x1b]+/g;
r.exec(new Buffer([0xcb, 0x98, 0x1b]));
console.log(r.lastIndex);
I expected that the pattern would match 0xcb, 0x98
and r.lastIndex == 2
, but it matches 0xcb
only and r.lastIndex == 1
.
Is that a bug or something?
Upvotes: 2
Views: 4660
Reputation: 106726
regexp.exec()
implicitly converts its argument to a string and Buffer
's default encoding for toString()
is UTF-8. So you will not be able to see the individual bytes any longer with that encoding. Instead, you will need to explicitly use the 'latin1'
encoding (e.g. Buffer.from([0xcb, 0x98, 0x1b]).toString('latin1')
), which is a single-byte encoding (making the result the equivalent of '\xcb\x98\x1b'
).
Upvotes: 4
Reputation: 1
You can use Array.prototype.filter()
to return an array containing values not equal to 0x1b
new Buffer([0xcb, 0x98, 0x1b].filter(byte => byte !== 0x1b))
Upvotes: 0
Reputation: 60577
RegExp.prototype.exec
works on strings. This means that the Buffer
is being implicitly cast to a string by the toString
method.
In doing so, the bytes are read as a UTF-8 string, as UTF-8 is the default encoding. From Node's Buffer documentation:
buf.toString([encoding[, start[, end]]])
encoding
<string>
The character encoding to decode to. Default:'utf8'
...
0xcb
and 0x98
are read as a single UTF-8 character (˘
), thus the third byte ends up being at the 1 index, not the 2 index.
One option might be to explicitly call the toString
method with a different encoding, but I'm thinking regex is probably not the best option here.
Upvotes: 2