Reputation: 109
I'm trying to grab some information from a website, I have a script I've written and edited for a few different websites but this one in-particular is causing me grief!
The script reads through the categories and builds an array or pages to open, then goes ahead and opens each page, it's then supposed to grab info from each product on each category page. The functionality of building the array is still working fine, it just seems the markup is so different on this website it reacts differently.
I need to read from this markup
<li>
<a class="product-link" href="http://www.DOMAIN/producturl_1">
<img class='product_image' src="image/path_1.jpg" title="" alt="PRODUCT NAME" /></a>
<a class="product-title" href="http://www.DOMAIN/producturl_1">PRODUCT NAME 1</a>
<span>PRICE</span>
</li><!----><li>
<a class="product-link" href="http://www.DOMAIN/producturl_2">
<img class='product_image' src="image/path_2.jpg" title="" alt="PRODUCT NAME 2" /></a>
<a class="product-title" href="http://www.DOMAIN/producturl">PRODUCT NAME 2</a>
<span>PRICE</span>
</li><!----><li>
<a class="product-link" href="http://www.DOMAIN/producturl_3">
<img class='product_image' src="image/path_3.jpg" title="" alt="PRODUCT NAME 3" /></a>
<a class="product-title" href="http://www.DOMAIN/producturl_3">PRODUCT NAME 3</a>
<span>PRICE</span>
</li><!----><li>
<a class="product-link" href="http://www.DOMAIN/producturl">
<img class='product_image' src="image/path.jpg" title="" alt="PRODUCT NAME" /></a>
<a class="product-title" href="http://www.DOMAIN/producturl">PRODUCT NAME</a>
<span>PRICE</span>
</li>
My script;
Each product is within a <li>
tag
page.css('li').each do |product|
# ...
end
I can pick up the product name with
product.css('.product-title').text.strip
Then usually to grab the product URL I'd define the tags the URL is within and use something like this to grab the contents of the href and the gsub to get rid of the newline
product.css('.product-title')[:href].gsub(/\n/,"")
In this case, I'm getting
./script.rb:52:in []: no implicit conversion of Symbol into Integer (TypeError)
from ./script.rb:52:in block in <main>
from /usr/local/lib/ruby/gems/2.0.0/gems/nokogiri-1.6.3.1/lib/nokogiri/xml/node_set.rb:237:in block in each
from /usr/local/lib/ruby/gems/2.0.0/gems/nokogiri-1.6.3.1/lib/nokogiri/xml/node_set.rb:236:in upto
from /usr/local/lib/ruby/gems/2.0.0/gems/nokogiri-1.6.3.1/lib/nokogiri/xml/node_set.rb:236:in each
from ./script.rb:39:in <main>
How can I get it to read the href
? I can't work out why it's throwing this error, when it usually works with different websites.
Upvotes: 0
Views: 182
Reputation: 114268
product.css('.product-title')
returns a NodeSet
, similar to an array.
Either use first
or [0]
to get the first element:
product.css('.product-title').first['href'] #=> "http://www.DOMAIN/producturl"
product.css('.product-title')[0]['href'] #=> "http://www.DOMAIN/producturl"
or the at_css
shortcut:
product.at_css('.product-title')['href'] #=> "http://www.DOMAIN/producturl"
A more complete example:
require 'nokogiri'
page = Nokogiri::HTML(<<-HTML)
<li>
<a class="product-link" href="http://www.DOMAIN/producturl">
<img class='product_image' src="image/path.jpg" title="" alt="PRODUCT NAME" />
</a>
<a class="product-title" href="http://www.DOMAIN/producturl">PRODUCT NAME</a>
<span>PRICE</span>
</li>
HTML
page.css('li').each do |product|
puts product.at_css('.product-title')['href']
end
Output:
http://www.DOMAIN/producturl
Upvotes: 1