How to prevent abusive crawlers from crawling a rails app deployed on Heroku?

Question

I want to restrict the crawler access to my rails app running on Heroku. This would have been a straight forward task if I was using Apache OR nginX. Since the app is deployed on Heroku I am not sure how I can restrict access at the HTTP server level.

I have tried to use robots.txt file, but the offending crawlers don't honor robot.txt.

These are the solutions I am considering:

1) A before_filter in the rails layer to restrict access.

2) Rack based solution to restrict access

I am wondering if there are any better ways to deal with this problem.

Wukerplank · Accepted Answer

I have read about honeypot solutions: You have one URI that must not be crawled (put it in robots.txt). If any IP calls this URI, block it. I'd implement it as a Rack middleware so the hit does not go to the full Rails stack.

Sorry, I googled around but could not find the original article.

How to prevent abusive crawlers from crawling a rails app deployed on Heroku?

Answers (1)

Related Questions