Reputation: 653
I'm using Nokogiri in Rails while attempting to get the results from multiple pages of a Subreddit but I can only get the first page. Any ideas on how to acheive this. I can't seem to figure this out. Here's my current RedditScraper class:
require 'nokogiri'
require 'open-uri'
class RedditScraper
def initialize
@headline = []
end
def fetch_reddit_headlines
page = Nokogiri::HTML(open("http://www.reddit.com/r/ruby"))
page.css('a.title').each do |link|
if link['href'].include?('http')
@headline << { content: link.content, href: link['href'] }
else
@headline << { content: link.content, href: "http://reddit.com" + link['href'] }
end
end
@headline
end
end
JUST ADDED
Controller method:
def index
@fetch_reddit = RedditScraper.new.fetch_reddit_headlines
end
View code:
<ol>
<% @fetch_reddit.each do |url| %>
<li><%= link_to url[:content], url[:href], target: '_' %></li>
<% end %>
</ol>
Screenshot
Upvotes: 0
Views: 1322
Reputation: 2005
If you use Mechanize with Nokogiri, you could click on the next page link by doing something like this:
Update: Fixed some bugs
require 'nokogiri'
require 'open-uri'
require 'mechanize'
class RedditScraper
def initialize
@headline = []
@agent = Mechanize.new
end
def fetch_reddit_headlines
mech_page = @agent.get('http://www.reddit.com/r/ruby')
num_pages_to_scrape = 10
count = 0
while(num_pages_to_scrape > count)
page = mech_page.parser
page.css('a.title').each do |link|
if link['href'].include?('http')
@headline << { content: link.content, href: link['href'] }
else
@headline << { content: link.content, href: "http://reddit.com" + link['href'] }
end
end
@headline
count += 1
mech_page = @agent.get(page.css('.nextprev').css('a').last.attributes["href"].value)
end
return @headline
end
end
r = RedditScraper.new
r.fetch_reddit_headlines
puts r.instance_variable_get(:@headline)
puts r.instance_variable_get(:@headline).count
Upvotes: 1