OrelFligelman
OrelFligelman

Reputation: 9

Web Scraping with Nokogiri and Mechanize

I am parsing prada.com and would like to scrape data in the div class "nextItem" and get its name and price. Here is my code:

require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'open-uri'
agent = Mechanize.new
page = agent.get('http://www.prada.com/en/US/e-store/department/woman/handbags.html?cmp=from_home')
fp = File.new('prada_prices','w')
html_doc = Nokogiri::HTML(page)
page = html_doc.xpath("//ol[@class='nextItem']")
page.each do {|i| fp.write(i.text + "\n")}
end

I get an error and no output. What I think I am doing is instantiating a mechanize object and calling it agent. Then creating a page variable and assigning it the url provided. Then creating a variable that is a nokogiri object with the mechanize url passed in Then searching the url for all class references that are titled nextItem Then printing all the data contained there

Can someone show me where I might have went wrong?

Upvotes: 0

Views: 2850

Answers (2)

hahcho
hahcho

Reputation: 1409

Here are the wrong parts:

  • Check again the block syntax - use {} or do/end but not both in the same time.
  • Mechanize#get returns a Mechanize::Page which act as a Nokogiri document, at least it has search, xpath, css. Use them instead of trying to coerce the document to a Nokogiri::HTML object.
  • There is no need to require 'open-uri', and require 'nokogiri' when you are not using them directly.
  • Finally check maybe more about Ruby's basics before continuing with web scraping.

Here is the code with fixes:

require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.prada.com/en/US/e-store/department/woman/handbags.html?cmp=from_home')
fp = File.new('prada_prices','w')
page = page.search("//ol[@class='nextItem']").each do |i| 
  fp.write(i.text + "\n")
end
fp.close

Upvotes: 0

davegson
davegson

Reputation: 8331

Since Prada's website dynamically loads its content via JavaScript, it will be hard to scrape its content. See "Scraping dynamic content in a website" for more information.

Generally speaking, with Mechanize, after you get a page:

page = agent.get(page_url)

you can easily search items with CSS selectors and scrape for data:

next_items = page.search(".fooClass")

next_items.each do |item|
  price = item.search(".fooPrice").text
end

Then simply handle the strings or generate hashes as you desire.

Upvotes: 2

Related Questions