garythegoat
garythegoat

Reputation: 1587

String Encoding in Ruby

I recently started working with encoding in Ruby, and am confused by some behavior.

I'm using 2.2.3p173 and am showing the following:

__ENCODING__             #=> #<Encoding:UTF-8>  Default encoding in 2.2.3

"my_string".encoding     #=> #<Encoding:UTF-8>
Object.to_s.encoding     #=> #<Encoding:US-ASCII>
Object.new.to_s.encoding #=> #<Encoding:ASCII-8BIT>

What's the cause of this discrepancy in encodings?

Upvotes: 7

Views: 1675

Answers (2)

Max
Max

Reputation: 22385

Nice find!

The short answer is it's completely arbitrary and it depends on how Ruby internally builds the strings that are being returned.

There are a whole host of internal C functions that construct empty strings or literal strings with US-ASCII encoding: rb_usascii_str_new and similar. They're frequently used to construct strings by appending smaller fragments of strings. Almost every to_s method does this:

[].to_s.encoding
#<Encoding:US-ASCII>
{}.to_s.encoding
#<Encoding:US-ASCII>
$/.to_s.encoding
#<Encoding:US-ASCII>
1.to_s.encoding
#<Encoding:US-ASCII>
true.to_s.encoding
#<Encoding:US-ASCII>
Object.to_s.encoding
#<Encoding:US-ASCII>

So why not Object.new.to_s? The key here is that Object#to_s is the fallback to_s method for every class, so in order to make it generic-yet-informative they coded it to output the value of the object's internal pointer. The easiest way to do that is with sprintf and the %p specifier. BUT whoever coded Ruby's sprintf wrapper rb_sprintf got lazy and just set the encoding to NULL which falls back to ASCII-8BIT. So generally anything that returns a formatted string will have this encoding:

Object.new.to_s
#<Encoding:ASCII-8BIT>
nil.sort rescue $!.to_s.encoding
#<Encoding:ASCII-8BIT>
[].each.to_s.encoding
#<Encoding:ASCII-8BIT>

As for strings defined by a script, those get the default encoding UTF-8 as you would expect.

Upvotes: 6

Aguardientico
Aguardientico

Reputation: 7779

Object is defined in C if you try the following:

String(123456).encoding #=> #<Encoding:ASCII-8BIT>
"123456".encoding       #=> #<Encoding:UTF-8>

I did not dig a lot in ruby source code but it looks like is harcoded the encoding (rb_usascii_str_new2) for to_s

Upvotes: 2

Related Questions