Uncle Aaroh
Uncle Aaroh

Reputation: 831

How to parse XML with nokogiri without losing HTML entities?

If you look at the output below in the after section ruby is removing all the html entities. How to parse XML with nokogiri without loosing HTML entities?

--- BEFORE ---

<blog:entryFull>
&lt;p&gt;&lt;iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>

--- AFTER --- 

<blog:entryFull>
piframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"/iframe/p</blog:entryFull>
  </blog:example>

Here is the code:

f = File.open(item)

contents = ""
f.each {|line|
  contents << line
}

puts "--- BEFORE ---"
puts contents
puts "--- AFTER ---"

doc = Nokogiri::XML::DocumentFragment.parse(contents) 
puts doc
f.close 

Upvotes: 4

Views: 1061

Answers (3)

riverpuro
riverpuro

Reputation: 26

Your test file might have some invalid HTML entities.

nokogiri.rb:

require 'nokogiri'

puts "--- INVALID ---"
invalid_xml = <<-XML
<blog:entryFull>invalid M&Ms</blog:entryFull><!-- invalid M and M's -->
<blog:entryFull>
&lt;p&gt;&lt;iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>
XML

doc = Nokogiri::XML::DocumentFragment.parse(invalid_xml)
puts doc

puts "--- VALID ---"
valid_xml = <<-XML
<blog:entryFull>valid M&amp;Ms</blog:entryFull><!-- valid M and M's -->
<blog:entryFull>
&lt;p&gt;&lt;iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>
XML

doc = Nokogiri::XML::DocumentFragment.parse(valid_xml)
puts doc

result:

$ ruby nokogiri.rb
--- INVALID ---
<blog:entryFull>invalid M</blog:entryFull><!-- invalid M and M's -->
<blog:entryFull>
piframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"/iframe/p</blog:entryFull>
--- VALID ---
<blog:entryFull>valid M&amp;Ms</blog:entryFull><!-- valid M and M's -->
<blog:entryFull>
&lt;p&gt;&lt;iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>

so,

  1. Fix input XML
  2. Use STRICT ParseOptions

strict parsing example:

invalid_xml = <<-XML
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <blog:entryFull>invalid M&Ms</blog:entryFull>
  <blog:entryFull>
  &lt;p&gt;&lt;iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>
</root>
XML

begin
  doc = Nokogiri::XML(invalid_xml) do |configure|
    configure.strict # strict parsing
  end
  puts doc
rescue => e
  puts 'INVALID XML'
end

Upvotes: 1

Uncle Aaroh
Uncle Aaroh

Reputation: 831

The work-around that i did was to fetch the xml tag through regex and then convert html entities using html entities. Then parse it with nokogiri html parser.

Upvotes: 0

Joel Brewer
Joel Brewer

Reputation: 1652

Qambar, I am unable to recreate your issue. However, I am able to produce your desired output given these files/input:

test.xml

<blog:entryFull> &lt;p&gt;&lt;iframe src="https://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true%22" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>

nokogiri.rb

require 'nokogiri'

f = File.open("./test.html")

contents = ""
f.each {|line|
  contents << line
}

puts "--- BEFORE ---"
puts contents
puts "--- AFTER ---"

doc = Nokogiri::XML::DocumentFragment.parse(contents) 
puts doc.inner_html
f.close

Console

Development/Code » ruby nokogiri.rb
--- BEFORE ---
<blog:entryFull> &lt;p&gt;&lt;iframe src="https://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true%22" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>
--- AFTER ---
<blog:entryFull> &lt;p&gt;&lt;iframe src="https://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true%22" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>

Upvotes: 0

Related Questions