Reputation: 15996
I'm running into some strange behaviour and inconsistency in the way that Ruby (v2.5.3) deals with encoded strings versus the YAML parser. Here's an example:
"\x80" # Returns "\x80"
"\x80".bytesize # Returns 1
"\x80".bytes # Returns [128]
"\x80".encoding # Returns UTF-8
YAML.load('{value: "\x80"}')["value"] # Returns "\u0080"
YAML.load('{value: "\x80"}')["value"].bytesize # Returns 2
YAML.load('{value: "\x80"}')["value"].bytes # Returns [194, 128]
YAML.load('{value: "\x80"}')["value"].encoding # Returns UTF-8
My understanding of UTF-8 is that any single-byte value above 0x7F
should be encoded into two bytes. So my questions are the following:
"\x80"
valid UTF-8?Upvotes: 1
Views: 1310
Reputation: 22325
It is not valid UTF-8
"\x80".valid_encoding?
# false
Ruby is claiming it is UTF-8 because all String literals are UTF-8 by default, even if that makes them invalid.
I don't think you can force the YAML parser to return invalid UTF-8. But to get Ruby to convert that character you can do this
"\x80".b.ord.chr('utf-8')
# "\u0080"
.b
is only available in Ruby 2+. You need to use force_encoding
otherwise.
Upvotes: 3