meowmixplzdeliver
meowmixplzdeliver

Reputation: 209

Iteration in Mechanize to crawl a page

I wish to automate a process using Mechanize to crawl some web pages and save information.

The page is look book north america .

I wish to iterate through the ul id="looks" and, inside that iteration, click on every user inside the looks. So the element would look something like this:

<a href="/luciamouet" data-page-track="user name click" data-track="user name click | byline" target="_blank" title="Lucia Mouet">Lucia M.</a>

I wish to go to each user and store some information from that page.

This is what I have so far but I'm stumped on how to iterate and follow the link for each user:

require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'open-uri'

agent = Mechanize.new

page = agent.get('http://lookbook.nu/north-america')

looks = page.parser.css('#looks p')

 looks.each do |x|
     puts x
 end

Upvotes: 0

Views: 547

Answers (2)

pguardiario
pguardiario

Reputation: 54984

Rather than mess around with base + path as suggested by @radubogdan, just use page.uri:

page.search('#looks h1 a').each do |a|
  url = page.uri.merge a[:href]
  page2 = agent.get url
  puts page2.title
end

Upvotes: 1

radubogdan
radubogdan

Reputation: 2834

You have everything to construct the detail page URL. Grab the relative URL (I will call it path) append the base URL and make a new request.

require 'mechanize'

agent = Mechanize.new
agent.pluggable_parser.default = Mechanize::Page

base = 'http://lookbook.nu'
page = agent.get(base + '/north-america')

detail_pages = page.search("//div[contains(@class, 'look_meta_container')]/p/a[1]/@href").map(&:text)
# ["/user/1069907-Veronica-P", "/elliott_alexzander", "/neno", "/skirtsofurban", "/tovogueorbust", "/dthutt", "/ryapie", "/lovebetweentheracks", "/lonleyboy", "/bobbyraffin", "/tsangtastic", "/user/737385-Katia-H"]

detail_pages.each do |path|
  page = agent.get(base + path)

  name = page.search("//div[@id='userheader']//h1/a").text
  fans = page.search("//span[contains(text(), 'Fans')]/../span[1]").text

  puts name + " have " + fans + " fans"
end

=>

Veronica  P have 26,044 fans
Elliott Alexzander have 3,409 fans
Neno Neno have 15,304 fans
Laura P have 975 fans
Alexandra G. have 620 fans
Dayeanne  Hutton have 336 fans
Mariah Alysz have 288 fans
Lina Dinh have 11,675 fans
Talal Amine have 882 fans
Bobby Raffin have 72,469 fans
Jenny Tsang have 8,909 fans
Katia H. have 282 fans

Note: I used #pluggable_parser.default in order to get a Mechanize::Page response. Usually you don't need that but they didn't setup content-type correctly.

Upvotes: 1

Related Questions