Amadan
Amadan

Reputation: 198436

REXML and encoding

Can anyone please explain this result for me?

#!/usr/bin/env ruby
# encoding: utf-8

require 'rexml/document'

doc = REXML::Document.new(DATA)
puts "doc: #{doc.encoding}"
REXML::XPath.each(doc, '//item') do |item|
  puts "  #{item}: #{item.to_s.encoding}"
end

__END__
<doc>
  <item>Test</item>
  <item>Über</item>
  <item>8</item>
</doc>

Output:

doc: UTF-8
  <item>Test</item>: US-ASCII
  <item>Über</item>: UTF-8
  <item>8</item>: US-ASCII

It seems as if REXML doesn't care what the document encoding is, and starts autodetecting encoding for each item... Am I doomed to encode('UTF-8') each string I pull out of REXML, even though UTF-8 is the original encoding? What is happening here?

Upvotes: 0

Views: 786

Answers (1)

Darshan Rivka Whittle
Darshan Rivka Whittle

Reputation: 34041

You're calling Node.to_s() on your Element. To get the actual text, add Element.get_text() to your chain (and call Text.to_s() on that):

puts "  #{item}: #{item.get_text.to_s.encoding}"

Output:

doc: UTF-8
  <item>Test</item>: UTF-8
  <item>Über</item>: UTF-8
  <item>8</item>: UTF-8

Upvotes: 1

Related Questions