Grabbing URL with nokogiri

Question

I'm trying to grab some information from a website, I have a script I've written and edited for a few different websites but this one in-particular is causing me grief!

The script reads through the categories and builds an array or pages to open, then goes ahead and opens each page, it's then supposed to grab info from each product on each category page. The functionality of building the array is still working fine, it just seems the markup is so different on this website it reacts differently.

I need to read from this markup


        
            

        PRODUCT NAME 1

        PRICE
    

    
        

    PRODUCT NAME 2

    PRICE

    
        

    PRODUCT NAME 3

    PRICE

    
        

    PRODUCT NAME

    PRICE

My script;

Each product is within a

tag

page.css('li').each do |product|
  # ...
end

I can pick up the product name with

product.css('.product-title').text.strip

Then usually to grab the product URL I'd define the tags the URL is within and use something like this to grab the contents of the href and the gsub to get rid of the newline

product.css('.product-title')[:href].gsub(/
/,"")

In this case, I'm getting

./script.rb:52:in []: no implicit conversion of Symbol into Integer (TypeError)
    from ./script.rb:52:in block in 
    from /usr/local/lib/ruby/gems/2.0.0/gems/nokogiri-1.6.3.1/lib/nokogiri/xml/node_set.rb:237:in block in each
    from /usr/local/lib/ruby/gems/2.0.0/gems/nokogiri-1.6.3.1/lib/nokogiri/xml/node_set.rb:236:in upto
    from /usr/local/lib/ruby/gems/2.0.0/gems/nokogiri-1.6.3.1/lib/nokogiri/xml/node_set.rb:236:in each
    from ./script.rb:39:in

How can I get it to read the href? I can't work out why it's throwing this error, when it usually works with different websites.

Stefan · Accepted Answer

product.css('.product-title') returns a NodeSet, similar to an array.

Either use first or [0] to get the first element:

product.css('.product-title').first['href'] #=> "http://www.DOMAIN/producturl"
product.css('.product-title')[0]['href']    #=> "http://www.DOMAIN/producturl"

or the at_css shortcut:

product.at_css('.product-title')['href']    #=> "http://www.DOMAIN/producturl"

A more complete example:

require 'nokogiri'

page = Nokogiri::HTML(<<-HTML)

  
    
  
  PRODUCT NAME
  PRICE

HTML

page.css('li').each do |product|
  puts product.at_css('.product-title')['href']
end

Output:

http://www.DOMAIN/producturl

Grabbing URL with nokogiri

Answers (1)

Related Questions