Josh
Josh

Reputation: 17

How to bypass network errors while using Ruby Mechanize web crawling

I am using the Ruby mechanize web crawler to pull data from popular real estate websites. I'm using the home address as keywords to scrape the public data on Zillow, Redfin, etc. I'm basically trying to bypass any HTTP and network errors. The following rescue function doesn't seem to do the job.

def scrape_single(key_word)
    #setup agent
    agent = Mechanize.new{ |agent|
        agent.user_agent_alias = 'Mac Safari'
    }
    agent.ignore_bad_chunking = true
    agent.verify_mode = OpenSSL::SSL::VERIFY_NONE 
    agent.request_headers = { "Accept-Encoding" => ""}
    agent.follow_meta_refresh = true
    agent.keep_alive = false

    #page setup
    begin
      agent.get(@@search_engine) do |page|
        @@search_result = page.form('f') do |search|
          search.q = key_word
        end.submit
      end 
    rescue Timeout::Error
      puts "Timeout"
      retry
    rescue Net::HTTPGatewayTimeOut => e
      if e.response_code == '504' || '502'
        e.skip
        sleep 5
      end
    rescue Net::HTTPBadGateway  => e
      if e.response_code == '504' || '502'
        e.skip
        sleep 5
      end
    rescue Net::HTTPNotFound => e
      if e.response_code == '404'
        e.skip
        sleep 5
      end
    rescue Net::HTTPFatalError => e
      if e.response_code == '503'
        e.skip
      end
    rescue Mechanize::ResponseCodeError => e
      if e.response_code == '404'
        e.skip
        sleep 5
      elsif e.response_code == '502'
        e.skip
        sleep 5
      else
        retry
      end
    rescue Errno::ETIMEDOUT
      retry
    end

    return @@search_result      # returns Mechanize::Page
  end 

The following is an example of error message I get for a keyword with an address in MA.

/home/ec2-user/.gem/ruby/2.1/gems/mechanize-2.7.5/lib/mechanize/http/agent.rb:323:in `fetch': 404 => Net::HTTPNotFound for https://www.redfin.com/MA/WASHINGTON/306-WERDEN-RD-Unknown/home/134059623 -- unhandled response (Mechanize::ResponseCodeError)

The actual message you see when you input the above URL is:

Cannot GET /MA/WASHINGTON/306-WERDEN-RD-Unknown/home/134059623

My goal is to simply ignore and skip sporadic errors and move onto next keyword. I couldn't really find a working solution online and any feedback would be greatly appreciated.

Upvotes: 1

Views: 943

Answers (1)

Makushimaru
Makushimaru

Reputation: 111

If I understand the error raised is Mechanize::ResponseCodeError and this is clearly a 404 response_code. But in your script you don't raise 404 response_code from Mechanize::ResponseCodeError

all_response_code = ['403', '404', '502']

rescue Mechanize::ResponseCodeError => e
  if all_response_code.include? response_code 
    e.skip
    sleep 5
  else
    retry
  end

Maybe if you add a condition for the 404 response_code, it will do the trick

EDIT I changed the code a little bit in order to have less lines

Upvotes: 1

Related Questions