Reputation: 2085
I want to get some data from the HMs website, using this scraper:
require 'nokogiri'
require 'open-uri'
require 'rmagick'
require 'mechanize'
product = "http://www2.hm.com/es_es/productpage.0250933004.html"
web = Nokogiri::HTML(open(product))
puts web.at_css('.product-item-headline').text
Nokogiri returns NIL for each selector and raises undefined method for nilClass
. I don't know if this particular website has something that can avoid scraping.
In the URL DOM, I can see there is a .product-item-headline
class, and I can fetch the info in the JavaScript console, but I can't with Nokogiri.
I tried targeting the whole body text, and this is the only thing I get printed.
var callcoremetrix = function(){cmSetClientID(getCoremetricsClientId(), true, "msp.hm.com", "hm.com");};
Maybe some JavaScript is ruining my scrape?
Upvotes: 1
Views: 1176
Reputation: 35483
One idea is to use IRB and go step by step:
irb
> require 'open-uri'
> html = open(product).read
Does the HTML contain the class name text?
> html =~ /product-item-headline/
=> 56099
Yes it does, and here's the line:
<h1 class="product-item-headline">
So try Nokogiri:
> require 'nokogiri'
web = Nokogiri::HTML(html)
=> success
Read the HTML text and try increasingly-broad queries related to your issue that take you nearer the top of the HTML, and see if they find results:
web.css("h1") # on line 2217 of the HTML
=> []
web.css(".product-detail-meta") # on line 2215
=> []
web.css(".wrapper") # on line 86
=> []
web.css("body") # on line 84
=> [#<Nokogiri::XML::Element …
This shows you there's a problem in the HTML. The parsing is disrupted between lines 84 and 86.
Let's guess that line 85 may be the issue: it is a <header>
tag, and we happen to know that doesn't contain your target, so we can delete it. Save the HTML to a file, then use any text editor to delete the tag and all its contents, then re-parse.
Does it work now?
web.css("h1") # on line 359 of the HTML
=> []
Nope. So we repeat this process, cutting down the HTML.
I also like to cut down the HTML by removing pieces that I know don't contain my target, such as the <head>
area, <footer>
areas, <script>
areas etc.
You may like to use an auto-indenting editor, because it can quickly show you that something is unbalanced with the HTML.
Eventually we find that the HTML has many incorrect tags, such as unclosed section tags.
You can solve this a variety of ways:
The pure way is to fix the unclosed section tags, any way to you want.
The hack way is to narrow the HTML to the area you know you need, which is in the h1 tag.
Here's the hack way:
area = html.match(/<h1 class="product-item-headline\b.*?<\/h1>/m)[0]
web = Nokogiri::HTML(area)
puts web.at_css(".product-item-headline").text.strip
=> "Funda de cojín de jacquard"
Heads up that the hack way isn't truly HTML-savvy, and you can see that it will fail if the HTML page author changes to use a different tag, or uses another class name before the class name you want, etc.
The best long-term solution is to contact the author of the HTML page and show him how to validate the HTML. A good site for this is http://validator.w3.org/ -- when you validate your URL, the site shows 100 errors and 6 warnings, and explains each one and how to solve it.
Upvotes: 3