Reputation: 1587
I recently started working with encoding in Ruby, and am confused by some behavior.
I'm using 2.2.3p173 and am showing the following:
__ENCODING__ #=> #<Encoding:UTF-8> Default encoding in 2.2.3
"my_string".encoding #=> #<Encoding:UTF-8>
Object.to_s.encoding #=> #<Encoding:US-ASCII>
Object.new.to_s.encoding #=> #<Encoding:ASCII-8BIT>
What's the cause of this discrepancy in encodings?
Upvotes: 7
Views: 1675
Reputation: 22385
Nice find!
The short answer is it's completely arbitrary and it depends on how Ruby internally builds the strings that are being returned.
There are a whole host of internal C functions that construct empty strings or literal strings with US-ASCII encoding: rb_usascii_str_new
and similar. They're frequently used to construct strings by appending smaller fragments of strings. Almost every to_s
method does this:
[].to_s.encoding
#<Encoding:US-ASCII>
{}.to_s.encoding
#<Encoding:US-ASCII>
$/.to_s.encoding
#<Encoding:US-ASCII>
1.to_s.encoding
#<Encoding:US-ASCII>
true.to_s.encoding
#<Encoding:US-ASCII>
Object.to_s.encoding
#<Encoding:US-ASCII>
So why not Object.new.to_s
? The key here is that Object#to_s
is the fallback to_s
method for every class, so in order to make it generic-yet-informative they coded it to output the value of the object's internal pointer. The easiest way to do that is with sprintf
and the %p
specifier. BUT whoever coded Ruby's sprintf
wrapper rb_sprintf
got lazy and just set the encoding to NULL
which falls back to ASCII-8BIT
. So generally anything that returns a formatted string will have this encoding:
Object.new.to_s
#<Encoding:ASCII-8BIT>
nil.sort rescue $!.to_s.encoding
#<Encoding:ASCII-8BIT>
[].each.to_s.encoding
#<Encoding:ASCII-8BIT>
As for strings defined by a script, those get the default encoding UTF-8 as you would expect.
Upvotes: 6
Reputation: 7779
Object
is defined in C
if you try the following:
String(123456).encoding #=> #<Encoding:ASCII-8BIT>
"123456".encoding #=> #<Encoding:UTF-8>
I did not dig a lot in ruby source code but it looks like is harcoded the encoding (rb_usascii_str_new2
) for to_s
Upvotes: 2