Reputation: 7545
I'm trying to scrape Reddit with Nokogiri, but a single run of this keeps telling me that I'm putting in too many requests.
require 'nokogiri'
require 'open-uri'
url = "https://www.reddit.com/r/all"
redditscrape = Nokogiri::HTML(open(url))
OpenURI::HTTPError: 429 Too Many Requests
Isn't this only one request? If it's not, how do I create sleep intervals for Nokogiri?
Upvotes: 1
Views: 1887
Reputation: 141
The real answer is that you need to set a user-agent.
https://www.reddit.com/r/redditdev/comments/3qbll8/429_too_many_requests/
and
How to set a custom user agent in ruby
This allowed me to use open-uri and nokogiri and avoid the error.
so to summarize:
redditscrape = Nokogiri::HTML(open(url, 'User-Agent' => 'Nooby'))
Upvotes: 4
Reputation: 506
Reddit has an API
You could probably query the API for the particular sub-reddit(s) you want to scrape. Attempting to scrape all of Reddit just seems like a nightmare waiting to happen considering the high volume and the nested comments.
It looks like Reddit is blocking the ability to scrape in favor of using their public API.
Upvotes: 4