Arvid Janson
Arvid Janson

Reputation: 1034

Reading large XML file with Nokogiri

I'm having trouble reading a (somewhat) large XML file with Nokogiri, but can't figure out where things are going wrong. The file content looks like the following (just a single node included for readability):

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:g="http://base.google.com/ns/1.0"><title>End | Globally Sourced Menswear</title><link type="self" link="http://www.endclothing.com/eu/"/><updated>2016-04-02T01:25:30+00:00</updated><entry><g:id>391</g:id><g:mpn>WRY3924TY</g:mpn><g:color>N/A</g:color><g:title>Comme des Garcons x Artek Standard Eau De Toilette</g:title><g:link>http://www.endclothing.com/eu/comme-des-garcons-x-artek-standard-eau-de-toilette-wry3924ty.html</g:link><g:price>89.00 EUR</g:price><g:availability>in stock</g:availability><g:brand>CDG Parfum</g:brand><g:custom_label_0>Perfume &amp; Fragrance</g:custom_label_0><g:condition>new</g:condition><g:description><![CDATA[<p>Founded by 4 young idealists in 1935, Finnish design company Artek produce modern furniture to promote the modern culture of habitation. Here they collaborate with <a href="/brands/comme-des-garcons-parfum">Comme des Garçons Parfum</a> to produce a fragrance dubbed 'Standard', ironic for a scent that is anything but.</p>
<span style="font-style:italic;">Notes Include:</span>
<ul>
<li>Thyme</li>
<li>Black Pepper</li>
<li>Patchouli</li>
<li>Cedar Wood</li>
<li>Citrus</li>
</ul>
<p>Due to recent changes in regulations, we are unable to ship aftershaves and perfumes to certain destinations outside of the EU. For full details, please email <a href="mailto:[email protected]?subject=Aftershave and Perfume Shipment">[email protected]</a> or call +44 191 231 3983.</p>]]></g:description><g:image_link>http://media.endclothing.com/media/catalog/product/1/8/18-03-2016_commedesgarcons_xartekstandardeaudetoilette_100ml_sh_1.jpg</g:image_link><g:additional_image_link>http://media.endclothing.com/media/catalog/product/1/8/18-03-2016_commedesgarcons_xartekstandardeaudetoilette_100ml_sh_2.jpg</g:additional_image_link><g:shipping><g:country>FR</g:country><g:service>DPD Priority Service</g:service><g:price>9.00 EUR</g:price></g:shipping><g:shipping><g:country>DE</g:country><g:service>DPD Priority Service</g:service><g:price>9.00 EUR</g:price></g:shipping><g:shipping><g:country>DK</g:country><g:service>DPD Priority Service</g:service><g:price>9.00 EUR</g:price></g:shipping><g:shipping><g:country>NL</g:country><g:service>DPD Priority Service</g:service><g:price>9.00 EUR</g:price></g:shipping><g:shipping><g:country>IT</g:country><g:service>DPD Priority Service</g:service><g:price>9.00 EUR</g:price></g:shipping><g:shipping><g:country>SE</g:country><g:service>DPD Priority Service</g:service><g:price>9.00 EUR</g:price></g:shipping><g:shipping><g:country>BE</g:country><g:service>DPD Priority Service</g:service><g:price>9.00 EUR</g:price></g:shipping><g:shipping><g:country>AT</g:country><g:service>DPD Priority Service</g:service><g:price>15.00 EUR</g:price></g:shipping><g:shipping><g:country>IE</g:country><g:service>Parcel Force Priority Service</g:service><g:price>15.00 EUR</g:price></g:shipping><g:shipping><g:country>ES</g:country><g:service>DPD Priority Service</g:service><g:price>15.00 EUR</g:price></g:shipping><g:shipping><g:country>LV</g:country><g:service>DPD Priority Service</g:service><g:price>19.00 EUR</g:price></g:shipping><g:shipping><g:country>HR</g:country><g:service>DPD Priority Service</g:service><g:price>35.00 EUR</g:price></g:shipping><g:shipping><g:country>CY</g:country><g:service>FEDEX Priority Service</g:service><g:price>45.00 EUR</g:price></g:shipping><g:shipping><g:country>HU</g:country><g:service>DPD Priority Service</g:service><g:price>15.00 EUR</g:price></g:shipping><g:shipping><g:country>PT</g:country><g:service>DPD Priority Service</g:service><g:price>19.00 EUR</g:price></g:shipping><g:shipping><g:country>EE</g:country><g:service>DPD Priority Service</g:service><g:price>25.00 EUR</g:price></g:shipping><g:shipping><g:country>LU</g:country><g:service>DPD Priority Service</g:service><g:price>9.00 EUR</g:price></g:shipping><g:shipping><g:country>SK</g:country><g:service>DPD Priority Service</g:service><g:price>15.00 EUR</g:price></g:shipping><g:shipping><g:country>BG</g:country><g:service>DPD Priority Service</g:service><g:price>25.00 EUR</g:price></g:shipping><g:shipping><g:country>GR</g:country><g:service>FEDEX Priority Service</g:service><g:price>25.00 EUR</g:price></g:shipping><g:shipping><g:country>PL</g:country><g:service>DPD Priority Service</g:service><g:price>15.00 EUR</g:price></g:shipping><g:shipping><g:country>LT</g:country><g:service>DPD Priority Service</g:service><g:price>19.00 EUR</g:price></g:shipping><g:shipping><g:country>SI</g:country><g:service>DPD Priority Service</g:service><g:price>15.00 EUR</g:price></g:shipping><g:shipping><g:country>FI</g:country><g:service>Parcel Force Priority Service</g:service><g:price>19.00 EUR</g:price></g:shipping><g:shipping><g:country>CZ</g:country><g:service>DPD Priority Service</g:service><g:price>15.00 EUR</g:price></g:shipping><g:shipping><g:country>LI</g:country><g:service>FEDEX Priority Service</g:service><g:price>35.00 EUR</g:price></g:shipping><g:shipping><g:country>MC</g:country><g:service>DPD Priority Service</g:service><g:price>15.00 EUR</g:price></g:shipping><g:shipping><g:country>CH</g:country><g:service>Parcel Force Priority Service</g:service><g:price>15.00 EUR</g:price></g:shipping></entry></feed>

I've tried the following code to read the stream, and while the separate parts seem to work just fine (data outputs a string which appears to be valid XML to me), Nokogiri appears to be unable to read the string, and just crashes or returns nothing for my xpath queries.

url = "http://www.endclothing.com/media/end_feeds/eu.xml.gz"
stream = open(url, 'Accept-Encoding' => 'gzip')
data = Zlib::GzipReader.new(stream).read
page = Nokogiri::XML(data)

page.xpath("//entry")
=> []

Upvotes: 0

Views: 220

Answers (1)

har07
har07

Reputation: 89315

The XML has default namespace declared at the root element level :

xmlns="http://www.w3.org/2005/Atom"

In XML, descendant elements without prefix inherits default namespace from ancestor implicitly. That said, entry element that you tried to get is in the root element's default namespace.

On the other side, in XPath, element without prefix always considered in empty namespace. To reference element in XML's default namespace using XPath, we need to map a prefix to the default namespace URI and use that prefix in our XPath, for example :

page.xpath("//d:entry", 'd' => 'http://www.w3.org/2005/Atom')

Upvotes: 1

Related Questions