James
James

Reputation: 109

Grabbing URL with nokogiri

I'm trying to grab some information from a website, I have a script I've written and edited for a few different websites but this one in-particular is causing me grief!

The script reads through the categories and builds an array or pages to open, then goes ahead and opens each page, it's then supposed to grab info from each product on each category page. The functionality of building the array is still working fine, it just seems the markup is so different on this website it reacts differently.

I need to read from this markup

<li>
        <a class="product-link" href="http://www.DOMAIN/producturl_1">
            <img class='product_image' src="image/path_1.jpg" title=""  alt="PRODUCT NAME"  /></a>

        <a class="product-title" href="http://www.DOMAIN/producturl_1">PRODUCT NAME 1</a>

        <span>PRICE</span>
    </li><!----><li>
    <a class="product-link" href="http://www.DOMAIN/producturl_2">
        <img class='product_image' src="image/path_2.jpg" title=""  alt="PRODUCT NAME 2"  /></a>

    <a class="product-title" href="http://www.DOMAIN/producturl">PRODUCT NAME 2</a>

    <span>PRICE</span>
</li><!----><li>
    <a class="product-link" href="http://www.DOMAIN/producturl_3">
        <img class='product_image' src="image/path_3.jpg" title=""  alt="PRODUCT NAME 3"  /></a>

    <a class="product-title" href="http://www.DOMAIN/producturl_3">PRODUCT NAME 3</a>

    <span>PRICE</span>
</li><!----><li>
    <a class="product-link" href="http://www.DOMAIN/producturl">
        <img class='product_image' src="image/path.jpg" title=""  alt="PRODUCT NAME"  /></a>

    <a class="product-title" href="http://www.DOMAIN/producturl">PRODUCT NAME</a>

    <span>PRICE</span>
</li>

My script;

Each product is within a <li> tag

page.css('li').each do |product|
  # ...
end

I can pick up the product name with

product.css('.product-title').text.strip

Then usually to grab the product URL I'd define the tags the URL is within and use something like this to grab the contents of the href and the gsub to get rid of the newline

product.css('.product-title')[:href].gsub(/\n/,"")

In this case, I'm getting

./script.rb:52:in []: no implicit conversion of Symbol into Integer (TypeError)
    from ./script.rb:52:in block in <main>
    from /usr/local/lib/ruby/gems/2.0.0/gems/nokogiri-1.6.3.1/lib/nokogiri/xml/node_set.rb:237:in block in each
    from /usr/local/lib/ruby/gems/2.0.0/gems/nokogiri-1.6.3.1/lib/nokogiri/xml/node_set.rb:236:in upto
    from /usr/local/lib/ruby/gems/2.0.0/gems/nokogiri-1.6.3.1/lib/nokogiri/xml/node_set.rb:236:in each
    from ./script.rb:39:in <main>

How can I get it to read the href? I can't work out why it's throwing this error, when it usually works with different websites.

Upvotes: 0

Views: 182

Answers (1)

Stefan
Stefan

Reputation: 114268

product.css('.product-title') returns a NodeSet, similar to an array.

Either use first or [0] to get the first element:

product.css('.product-title').first['href'] #=> "http://www.DOMAIN/producturl"
product.css('.product-title')[0]['href']    #=> "http://www.DOMAIN/producturl"

or the at_css shortcut:

product.at_css('.product-title')['href']    #=> "http://www.DOMAIN/producturl"

A more complete example:

require 'nokogiri'

page = Nokogiri::HTML(<<-HTML)
<li>
  <a class="product-link" href="http://www.DOMAIN/producturl">
    <img class='product_image' src="image/path.jpg" title=""  alt="PRODUCT NAME"  />
  </a>
  <a class="product-title" href="http://www.DOMAIN/producturl">PRODUCT NAME</a>
  <span>PRICE</span>
</li>
HTML

page.css('li').each do |product|
  puts product.at_css('.product-title')['href']
end

Output:

http://www.DOMAIN/producturl

Upvotes: 1

Related Questions