aardvarkk
aardvarkk

Reputation: 15996

Invalid UTF-8 Ruby strings

I'm running into some strange behaviour and inconsistency in the way that Ruby (v2.5.3) deals with encoded strings versus the YAML parser. Here's an example:

"\x80"          # Returns "\x80"
"\x80".bytesize # Returns 1
"\x80".bytes    # Returns [128]
"\x80".encoding # Returns UTF-8

YAML.load('{value: "\x80"}')["value"]          # Returns "\u0080"
YAML.load('{value: "\x80"}')["value"].bytesize # Returns 2
YAML.load('{value: "\x80"}')["value"].bytes    # Returns [194, 128]
YAML.load('{value: "\x80"}')["value"].encoding # Returns UTF-8

My understanding of UTF-8 is that any single-byte value above 0x7F should be encoded into two bytes. So my questions are the following:

  1. Is the one byte string "\x80" valid UTF-8?
  2. If so, why does YAML convert into a two-byte pattern?
  3. If not, why is Ruby claiming the encoding is UTF-8 but containing an invalid byte sequence?
  4. Is there a way to make the YAML parser and the Ruby string behave in the same way as each other?

Upvotes: 1

Views: 1310

Answers (1)

Max
Max

Reputation: 22325

It is not valid UTF-8

"\x80".valid_encoding?
# false

Ruby is claiming it is UTF-8 because all String literals are UTF-8 by default, even if that makes them invalid.

I don't think you can force the YAML parser to return invalid UTF-8. But to get Ruby to convert that character you can do this

"\x80".b.ord.chr('utf-8')
# "\u0080"

.b is only available in Ruby 2+. You need to use force_encoding otherwise.

Upvotes: 3

Related Questions