Reputation: 1602
When using mechanize to pull some data from craigslist I keep getting the following error on Heroku: status: Net::HTTPForbidden 1.1 403 Forbidden
I am wondering what are some ways to prevent this from happening, my setup is below:
agent = Mechanize.new do |agent|
agent.log = @logger
agent.user_agent_alias = 'Mac Safari'
agent.robots = false
end
Any ideas?
Upvotes: 1
Views: 2245
Reputation: 4561
Figured I'd make this a bit cleaner. I had the same issue which I was able to resolve by requesting new headers:
@agent = Mechanize.new { |agent|
agent.user_agent_alias = 'Windows Chrome'}
@agent.request_headers
You should also include some error handling if you haven't already. I wrote the following to give an idea:
begin #beginning of block for handling rescue
@results_page = #getting some page and doing cool stuff
#The following line puts mechanize to sleep when a new page is reached for 1/10 second. This keeps you from overloading the site you're scraping and minimizing the chance of getting errors. If you start to get '503' errors you should increase this number a little!
@agent.history_added = Proc.new {sleep 0.1}
rescue Mechanize::ResponseCodeError => exception
if exception.response_code == "503"
@agent.history_added = Proc.new {sleep .2}
#the following line closes all active connections
@agent.shutdown
@agent = Mechanize.new { |agent|
agent.user_agent_alias = 'Windows Chrome'}
@agent.request_headers
@page = @agent.get('the-webpage-i-wanted.com')
@form = @page.#GettingBackToWhereIWas
redo
else
#more error handling if needed
end
***NOTE: Consider running this as a background process to avoid timeout errors on heroku since they only allow a 15-30 second request-response cycle. I use redisToGo (a heroku addon) and sidekiq (dl gem) if you're not doing it already!
Upvotes: 3
Reputation: 3197
When working with mechanize and other such browser emulator you have to monitor your network, I prefer Google chrome developer tools.
Inspect your your URL with normal browser and check these:
Debug these points because may be the URL you are accessing is restricted for:
I guess I am using too many "may be" but my point is if you can't post your link publicly I can just guess your error, In case your link is directly hitting a directory and its indexing is off then you can't browse it in mechanize either, If it is for specific user agents then you should initialize your mechanize with specific user agent like:
browser = Mechanize.new
browser.user_agent_alias = 'Windows IE 7'
In any other case you are not replicating your request either some important parameters are missing or you are sending wrong request type, headers may be missing.
EDIT: Now that you've provided link here is what you should do while dealing with https
Mechanize.new{|a| a.ssl_version, a.verify_mode = 'SSLv3', OpenSSL::SSL::VERIFY_NONE};
Upvotes: 0