How to bypass network errors while using Ruby Mechanize web crawling

Question

I am using the Ruby mechanize web crawler to pull data from popular real estate websites. I'm using the home address as keywords to scrape the public data on Zillow, Redfin, etc. I'm basically trying to bypass any HTTP and network errors. The following rescue function doesn't seem to do the job.

def scrape_single(key_word)
    #setup agent
    agent = Mechanize.new{ |agent|
        agent.user_agent_alias = 'Mac Safari'
    }
    agent.ignore_bad_chunking = true
    agent.verify_mode = OpenSSL::SSL::VERIFY_NONE 
    agent.request_headers = { "Accept-Encoding" => ""}
    agent.follow_meta_refresh = true
    agent.keep_alive = false

    #page setup
    begin
      agent.get(@@search_engine) do |page|
        @@search_result = page.form('f') do |search|
          search.q = key_word
        end.submit
      end 
    rescue Timeout::Error
      puts "Timeout"
      retry
    rescue Net::HTTPGatewayTimeOut => e
      if e.response_code == '504' || '502'
        e.skip
        sleep 5
      end
    rescue Net::HTTPBadGateway  => e
      if e.response_code == '504' || '502'
        e.skip
        sleep 5
      end
    rescue Net::HTTPNotFound => e
      if e.response_code == '404'
        e.skip
        sleep 5
      end
    rescue Net::HTTPFatalError => e
      if e.response_code == '503'
        e.skip
      end
    rescue Mechanize::ResponseCodeError => e
      if e.response_code == '404'
        e.skip
        sleep 5
      elsif e.response_code == '502'
        e.skip
        sleep 5
      else
        retry
      end
    rescue Errno::ETIMEDOUT
      retry
    end

    return @@search_result      # returns Mechanize::Page
  end

The following is an example of error message I get for a keyword with an address in MA.

/home/ec2-user/.gem/ruby/2.1/gems/mechanize-2.7.5/lib/mechanize/http/agent.rb:323:in `fetch': 404 => Net::HTTPNotFound for https://www.redfin.com/MA/WASHINGTON/306-WERDEN-RD-Unknown/home/134059623 -- unhandled response (Mechanize::ResponseCodeError)

The actual message you see when you input the above URL is:

Cannot GET /MA/WASHINGTON/306-WERDEN-RD-Unknown/home/134059623

My goal is to simply ignore and skip sporadic errors and move onto next keyword. I couldn't really find a working solution online and any feedback would be greatly appreciated.

How to bypass network errors while using Ruby Mechanize web crawling

Answers (1)

Related Questions