Reputation: 889
you will see through a series of questions I have built a little mechanize task to visit a page() find the links to cafes and save the details of the cafe in a csv.
task :estimateone => :environment do
require 'mechanize'
require 'csv'
mechanize = Mechanize.new
mechanize.history_added = Proc.new { sleep 30.0 }
mechanize.ignore_bad_chunking = true
mechanize.follow_meta_refresh = true
page = mechanize.get('http://www.siteexamplea.com/city/list/50-city-cafes-you-should-have-eaten-breakfast-at')
results = []
results << ['name', 'streetAddress', 'addressLocality', 'postalCode', 'addressRegion', 'addressCountry', 'telephone', 'url']
page.css('ol li a').each do |link|
mechanize.click(link)
name = mechanize.page.css('article h1[itemprop="name"]').text.strip
streetAddress = mechanize.page.css('address span span[itemprop="streetAddress"]').text.strip
addressLocality = mechanize.page.css('address span span[itemprop="addressLocality"]').text.strip
postalCode = mechanize.page.css('address span span[itemprop="postalCode"]').text.strip
addressRegion = mechanize.page.css('address span span[itemprop="addressRegion"]').text.strip
addressCountry = mechanize.page.css('address span meta[itemprop="addressCountry"]').text.strip
telephone = mechanize.page.css('address span[itemprop="telephone"]').text.strip
url = mechanize.page.css('article p a[itemprop="url"]').text.strip
tags = mechanize.page.css('article h1[itemprop="name"]').text.strip
results << [name, streetAddress, addressLocality, postalCode, addressRegion, addressCountry, telephone, url]
end
CSV.open("filename.csv", "w+") do |csv_file|
results.each do |row|
csv_file << row
end
end
end
when i get to the tenth link I hit a 503 error.
Mechanize::ResponseCodeError: 503 => Net::HTTPServiceUnavailable for https://www.city.com/city/directory/morning-after -- unhandled response
I have tried a couple of things to stop this happening or rescue from this state but I can't work it out. Any tips?
Upvotes: 1
Views: 481
Reputation: 1701
You'd want to rescue on failed request, just like here
task :estimateone => :environment do
require 'mechanize'
require 'csv'
begin
# ...
page = mechanize.get('http://www.theurbanlist.com/brisbane/a-list/50-brisbane-cafes-you-should-have-eaten-breakfast-at')
rescue Mechanize::ResponseCodeError
# do something with the result, log it, write it, mark it as failed, wait a bit and then continue the job
next
end
end
My guess is that you're hitting API rate limits. This will not solve your problem as it is not in your side but at the server's; but will give you range to work as now you can flag the links that did not work and continue from there on.
Upvotes: 1