Reputation: 10317
I have been practicing writing a number of Ruby scrapers using Mechanize and Nokogiri. For instance here ( However, it seems that after making a certain number of requests (about 14000 in this case) I get an error saying I have a connection timed out error:
/var/lib/gems/1.8/gems/net-http-persistent-2.5.1/lib/net/http/persistent/ssl_reuse.rb:90:in `initialize': Connection timed out - connect(2) (Errno::ETIMEDOUT)
I have Googled a lot online, but the best answer I can get is that I am making too many requests to the server. Is there a way to fix this by throttling or some other method?
Upvotes: 0
Views: 934
Reputation: 10317
After some more programming experience, I realized that this was a simple error on my part: my code did not catch the error thrown and appropriately move to the next link when a link was corrupted.
For any novice Ruby programmers that encounter a similar problem:
The Connection timed out error is usually due to an invalid link, etc. on the page being scrapped.
You need to wrap the code that is accessing link in a statement such as the below
begin
#[1 your scraping code here ]
rescue
#[2 code to move to the next link/page/etc. that you are scraping instead of sticking to the invalid one]
end
For instance, if you have a for loop that is iterating over links and extracting information from each link, then that should be at [1] and code to move to the next link (consider using something like ruby "next") should be placed at [2]. You might also consider printing something to the console to let the user know that a link was invalid.
Upvotes: 0