Reputation: 198436
Can anyone please explain this result for me?
#!/usr/bin/env ruby
# encoding: utf-8
require 'rexml/document'
doc = REXML::Document.new(DATA)
puts "doc: #{doc.encoding}"
REXML::XPath.each(doc, '//item') do |item|
puts " #{item}: #{item.to_s.encoding}"
end
__END__
<doc>
<item>Test</item>
<item>Über</item>
<item>8</item>
</doc>
Output:
doc: UTF-8
<item>Test</item>: US-ASCII
<item>Über</item>: UTF-8
<item>8</item>: US-ASCII
It seems as if REXML doesn't care what the document encoding is, and starts autodetecting encoding for each item... Am I doomed to encode('UTF-8')
each string I pull out of REXML, even though UTF-8 is the original encoding? What is happening here?
Upvotes: 0
Views: 786
Reputation: 34041
You're calling Node.to_s() on your Element
. To get the actual text, add Element.get_text()
to your chain (and call Text.to_s()
on that):
puts " #{item}: #{item.get_text.to_s.encoding}"
Output:
doc: UTF-8
<item>Test</item>: UTF-8
<item>Über</item>: UTF-8
<item>8</item>: UTF-8
Upvotes: 1