barnett
barnett

Reputation: 1602

403 Error with Mechanize on Heroku

When using mechanize to pull some data from craigslist I keep getting the following error on Heroku: status: Net::HTTPForbidden 1.1 403 Forbidden

I am wondering what are some ways to prevent this from happening, my setup is below:

agent = Mechanize.new do |agent|
  agent.log              = @logger
  agent.user_agent_alias = 'Mac Safari'
  agent.robots           = false
end

Any ideas?

Upvotes: 1

Views: 2245

Answers (2)

bkunzi01
bkunzi01

Reputation: 4561

Figured I'd make this a bit cleaner. I had the same issue which I was able to resolve by requesting new headers:

@agent = Mechanize.new { |agent|
                      agent.user_agent_alias = 'Windows Chrome'}


@agent.request_headers

You should also include some error handling if you haven't already. I wrote the following to give an idea:

begin  #beginning of block for handling rescue
              @results_page = #getting some page and doing cool stuff
         #The following line puts mechanize to sleep when a new page is reached for 1/10 second.  This keeps you from overloading the site you're scraping and minimizing the chance of getting errors.  If you start to get '503' errors you should increase this number a little!
              @agent.history_added = Proc.new {sleep 0.1}

            rescue Mechanize::ResponseCodeError => exception
              if exception.response_code == "503"
                @agent.history_added = Proc.new {sleep .2}
              #the following line closes all active connections
                @agent.shutdown
                @agent = Mechanize.new { |agent|
                  agent.user_agent_alias = 'Windows Chrome'}
                @agent.request_headers
                @page = @agent.get('the-webpage-i-wanted.com')
                @form = @page.#GettingBackToWhereIWas
                redo 
                else
                #more error handling if needed
                end

***NOTE: Consider running this as a background process to avoid timeout errors on heroku since they only allow a 15-30 second request-response cycle. I use redisToGo (a heroku addon) and sidekiq (dl gem) if you're not doing it already!

Upvotes: 3

user2009750
user2009750

Reputation: 3197

When working with mechanize and other such browser emulator you have to monitor your network, I prefer Google chrome developer tools.

Inspect your your URL with normal browser and check these:

  1. Is this URL valid?
  2. Is this URL public?
  3. Is this URL browser restricted?
  4. Is this URL secured by login?
  5. What parameters does this URL expect in normal conditions?

Debug these points because may be the URL you are accessing is restricted for:

  • Public use
  • May be it is directory path, where indexing is not allowed
  • May be server has restricted it for some user agents
  • May be you are not replicating request completely

I guess I am using too many "may be" but my point is if you can't post your link publicly I can just guess your error, In case your link is directly hitting a directory and its indexing is off then you can't browse it in mechanize either, If it is for specific user agents then you should initialize your mechanize with specific user agent like:

browser = Mechanize.new
browser.user_agent_alias = 'Windows IE 7'

In any other case you are not replicating your request either some important parameters are missing or you are sending wrong request type, headers may be missing.

EDIT: Now that you've provided link here is what you should do while dealing with https

Mechanize.new{|a| a.ssl_version, a.verify_mode = 'SSLv3', OpenSSL::SSL::VERIFY_NONE};

Upvotes: 0

Related Questions