Reputation: 17
I am using the Ruby mechanize web crawler to pull data from popular real estate websites. I'm using the home address as keywords to scrape the public data on Zillow, Redfin, etc. I'm basically trying to bypass any HTTP and network errors. The following rescue function doesn't seem to do the job.
def scrape_single(key_word)
#setup agent
agent = Mechanize.new{ |agent|
agent.user_agent_alias = 'Mac Safari'
}
agent.ignore_bad_chunking = true
agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
agent.request_headers = { "Accept-Encoding" => ""}
agent.follow_meta_refresh = true
agent.keep_alive = false
#page setup
begin
agent.get(@@search_engine) do |page|
@@search_result = page.form('f') do |search|
search.q = key_word
end.submit
end
rescue Timeout::Error
puts "Timeout"
retry
rescue Net::HTTPGatewayTimeOut => e
if e.response_code == '504' || '502'
e.skip
sleep 5
end
rescue Net::HTTPBadGateway => e
if e.response_code == '504' || '502'
e.skip
sleep 5
end
rescue Net::HTTPNotFound => e
if e.response_code == '404'
e.skip
sleep 5
end
rescue Net::HTTPFatalError => e
if e.response_code == '503'
e.skip
end
rescue Mechanize::ResponseCodeError => e
if e.response_code == '404'
e.skip
sleep 5
elsif e.response_code == '502'
e.skip
sleep 5
else
retry
end
rescue Errno::ETIMEDOUT
retry
end
return @@search_result # returns Mechanize::Page
end
The following is an example of error message I get for a keyword with an address in MA.
/home/ec2-user/.gem/ruby/2.1/gems/mechanize-2.7.5/lib/mechanize/http/agent.rb:323:in `fetch': 404 => Net::HTTPNotFound for https://www.redfin.com/MA/WASHINGTON/306-WERDEN-RD-Unknown/home/134059623 -- unhandled response (Mechanize::ResponseCodeError)
The actual message you see when you input the above URL is:
Cannot GET /MA/WASHINGTON/306-WERDEN-RD-Unknown/home/134059623
My goal is to simply ignore and skip sporadic errors and move onto next keyword. I couldn't really find a working solution online and any feedback would be greatly appreciated.
Upvotes: 1
Views: 943
Reputation: 111
If I understand the error raised is Mechanize::ResponseCodeError and this is clearly a 404 response_code. But in your script you don't raise 404 response_code from Mechanize::ResponseCodeError
all_response_code = ['403', '404', '502']
rescue Mechanize::ResponseCodeError => e
if all_response_code.include? response_code
e.skip
sleep 5
else
retry
end
Maybe if you add a condition for the 404 response_code, it will do the trick
EDIT I changed the code a little bit in order to have less lines
Upvotes: 1