horseyguy
horseyguy

Reputation: 29895

Why is a UTF-8 string not equal to the equivalent ASCII-8BIT string in Ruby 2.0?

I am using Ruby 2.3:

I have the following string: "\xFF\xFE"

I do a File.binread() on a file containing it, so the encoding of this string is ASCII-8BIT. However, in my code, i check to see whether this string was indeed read by comparing it to the literal string "\xFF\xFE" (which has encoding UTF-8 as all Ruby strings have by default).

However, the comparison returns false, even though both strings contain the same bytes - it just happens that one is with encoding ASCII-8BIT and the other is UTF-8

I have two questions: (1) why does it return false ? and (2) what is the best way to go about achieving what i want? I just want to check whether the string I read matches "\xFF\xFE"

Upvotes: 3

Views: 1200

Answers (1)

Stefan
Stefan

Reputation: 114178

(1) why does it return false?

When comparing strings, they either have to be in the same encoding or their characters must be encodable in US-ASCII.

Comparison works as expected if the string only contains byte values 0 to 127: (0b0xxxxxxx)

a = 'E'.encode('ISO8859-1')  #=> "E"
b = 'E'.encode('ISO8859-15') #=> "E"

a.bytes #=> [69]
b.bytes #=> [69]
a == b  #=> true

And fails if it contains any byte values 128 to 255: (0b1xxxxxxx)

a = 'É'.encode('ISO8859-1')  #=> "\xC9"
b = 'É'.encode('ISO8859-15') #=> "\xC9"

a.bytes #=> [201]
b.bytes #=> [201]
a == b  #=> false

Your string can't be represented in US-ASCII, because both its bytes are outside its range:

"\xFF\xFE".bytes #=> [255, 254]

Attempting to convert it doesn't produce any meaningful result:

"\xFF\xFE".encode('US-ASCII', 'ASCII-8BIT', :undef => :replace)
#=> "??"

The string will therefore return false when being compared to a string in another encoding, regardless of its content.

(2) what is the best way to go about achieving what i want?

You could compare your string to a string with the same encoding. binread returns a string in ASCII-8BIT encoding, so you could use b to create a compatible one:

IO.binread('your_file', 2) == "\xFF\xFE".b

or you could compare its bytes:

IO.binread('your_file', 2).bytes == [0xFF, 0xFE]

Upvotes: 5

Related Questions