darren
darren

Reputation: 23

Nokogiri XML Builder creating unexpected output when scraping HTML

I am fairly new to Ruby and the world of programming so please, bear with me.

My goal is to scrape a table and then save the data to an XML file. The simple script that I've written successfully accomplishes both things. The problem I am having is the way the XML is being saved. It doesn't match the XML that I am used to seeing.

I've rummaged through quite a bit of examples, tutorials and forums but have yet to arrive at a solution.

I am open to any suggestions on a better way to get the data from the table as well, especially since the first three columns are all I really need. HELP!!!

Here is my script:

require 'nokogiri'
require 'open-uri'

url = "http://www.covers.com/pageLoader/pageLoader.aspx?page=
/data/nba/team/pastresults/2010-2011/team404085.html"
doc = Nokogiri::HTML(open(url))

builder = Nokogiri::XML::Builder.new do |xml|
  xml.root {
    xml.items {
       doc.css('.data').each do |o|
        xml.item_content = o
       end
    }
  }
end

File.open('ATL.xml','w'){|f| f.write builder.to_xml}

puts "Scrape Completed."  

Whether it's saved to an .xml file or printed on the screen in Ruby, the XML looks like this:

<?xml version="1.0"?>
<root>
  <items>
    <item_content=>&lt;table cellpadding="2" cellspacing="1" class="data"&gt;
&lt;tr class="datahead"&gt;
&lt;td width="11%"&gt;Date&lt;/td&gt;&#xD;
    &lt;td width="21%"&gt;Vs&lt;/td&gt;&#xD;
    &lt;td width="18%"&gt;Score&lt;/td&gt;&#xD;
    &lt;td width="27%"&gt;Type&lt;/td&gt;&#xD;
    &lt;td width="13%"&gt;ATL Line&lt;/td&gt;&#xD;
    &lt;td width="10%"&gt;O/U&lt;/td&gt;&#xD;
  &lt;/tr&gt;
&lt;tr class="datarow"&gt;
&lt;td&gt;&#xD;
        01/18/11&lt;/td&gt;&#xD;
      &lt;td&gt;&#xD;
        @ &lt;a href="/pageLoader/pageLoader.aspx?page=/data/nba/team/
team404171.html"&gt;Miami&lt;/a&gt;&#xD;
        &lt;/td&gt;&#xD;
      &lt;td&gt;&#xD;
        W &lt;a href="/pageLoader/pageLoader.aspx?page=/data/nba/
results/2010-2011/boxscore795345.html"&gt;&#xD;
        93-89&lt;/a&gt; (OT)&lt;/td&gt;&#xD;
      &lt;td&gt;&#xD;
        Regular Season&lt;/td&gt;&#xD;
      &lt;td&gt;&#xD;
        W 5.5&lt;/td&gt;&#xD;
      &lt;td&gt;&#xD;
        U 194&lt;/td&gt;&#xD;
    &lt;/tr&gt;

The above code is just a snippet as there are multiple rows. (44 Total)
What is the best way to go about doing this?

Upvotes: 2

Views: 2693

Answers (1)

Phrogz
Phrogz

Reputation: 303178

It's not clear what you want as your output; do you want the HTML from the original included in the XML, or just the contents of the HTML? In the future, it is helpful when you include an example of what you wanted along with an example of the problem. Let us solve both problems. First, we can reproduce your problem more simply like so:

require 'nokogiri'
doc = Nokogiri::XML <<ENDXML
  <root>
    <p class="foo">42</p>
    <p class="bar">99</p>
    <p class="foo">17</p>
  </root>
ENDXML

builder = Nokogiri::XML::Builder.new do |xml|
  xml.items {
    doc.css('.foo').each{ |o| xml.item_content = o }
  }
end    
puts builder.to_xml
#=> <?xml version="1.0"?>
#=> <items>
#=>   <item_content=>&lt;p class="foo"&gt;42&lt;/p&gt;</item_content=>
#=>   <item_content=>&lt;p class="foo"&gt;17&lt;/p&gt;</item_content=>
#=> </items>

If you wanted the contents of your HTML nodes only in the XML, and presuming you didn't want the equals sign to be part of the tag name, then:

builder = Nokogiri::XML::Builder.new do |xml|
  xml.items {
    doc.css('.foo').each{ |o| xml.item_content( o.text ) }
  }
end
puts builder.to_xml
#=> <?xml version="1.0"?>
#=> <items>
#=>   <item_content>42</item_content>
#=>   <item_content>17</item_content>
#=> </items>

If, on the other hand, you did want the raw HTML in your XML, but didn't want all the entities, then make it a CDATA block:

builder = Nokogiri::XML::Builder.new do |xml|
  xml.items {
    doc.css('.foo').each{ |o| xml.item_content{ xml.cdata o } }
  }
end
puts builder.to_xml
#=> <?xml version="1.0"?>
#=> <items>
#=>   <item_content><![CDATA[<p class="foo">42</p>]]></item_content>
#=>   <item_content><![CDATA[<p class="foo">17</p>]]></item_content>
#=> </items>

An XML CDATA block allows you to use characters normally reserved for XML markup without needing to express them as character entities.

Upvotes: 4

Related Questions