How to scrape pages originating from the same domain on Heroku

Question

I created a Facebook-style URL scraper for posting content.

When someone inputs a URL it will send a request and, in the backend, I use Nokogiri to scrape the URL to pull information to construct the post.

It works fine for all other websites like apple.com, sony.com, but when I use a link from my origin domain ("mywebsite.com") it times out, no error is displayed besides Heroku timing out the request after 30 secs. If I scrape my domain from my localhost on my computer it works.

Is there some kind of origin rule preventing Nokogiri from scraping pages origination from the same domain?

I'm using Ruby On Rails 3.1.10, Nokogiri 1.4.7 and Heroku Cedar Stack.

mind.blank · Accepted Answer

Is the scraping run in a background job or via a web worker? Do you have only 1 dyno? If your app has only 1 web worker then it might be busy trying to scrape and therefore can't serve the page.

Try scaling your dynos to 2 and see if the problem persists.

heroku ps:scale web=2

How to scrape pages originating from the same domain on Heroku

Answers (1)

Related Questions