Chip
Chip

Reputation: 653

Nokogiri in Rails - Scrape results of multiple pages

I'm using Nokogiri in Rails while attempting to get the results from multiple pages of a Subreddit but I can only get the first page. Any ideas on how to acheive this. I can't seem to figure this out. Here's my current RedditScraper class:

require 'nokogiri'
require 'open-uri'

class RedditScraper

def initialize
 @headline = []
end

def fetch_reddit_headlines
 page = Nokogiri::HTML(open("http://www.reddit.com/r/ruby"))
 page.css('a.title').each do |link|
   if link['href'].include?('http')
    @headline << { content: link.content, href: link['href'] }
   else
    @headline << { content: link.content, href: "http://reddit.com" + link['href'] }
   end
  end
  @headline
 end
end

JUST ADDED

Controller method:

def index
 @fetch_reddit = RedditScraper.new.fetch_reddit_headlines    
end

View code:

<ol>
 <% @fetch_reddit.each do |url| %>
  <li><%= link_to url[:content], url[:href], target: '_' %></li>
 <% end %>
</ol>

Screenshot

enter image description here

Upvotes: 0

Views: 1322

Answers (1)

Matthew Leonard
Matthew Leonard

Reputation: 2005

If you use Mechanize with Nokogiri, you could click on the next page link by doing something like this:

Update: Fixed some bugs

require 'nokogiri'
require 'open-uri'
require 'mechanize'


class RedditScraper

  def initialize
    @headline = []
    @agent = Mechanize.new
  end

  def fetch_reddit_headlines
    mech_page = @agent.get('http://www.reddit.com/r/ruby')

    num_pages_to_scrape = 10
    count = 0

    while(num_pages_to_scrape > count)
      page = mech_page.parser

      page.css('a.title').each do |link|
        if link['href'].include?('http')
          @headline << { content: link.content, href: link['href'] }
        else
          @headline << { content: link.content, href: "http://reddit.com" + link['href'] }
        end
      end
      @headline

      count += 1
      mech_page = @agent.get(page.css('.nextprev').css('a').last.attributes["href"].value)
    end

    return @headline
  end
end


r = RedditScraper.new
r.fetch_reddit_headlines
puts r.instance_variable_get(:@headline)
puts r.instance_variable_get(:@headline).count

Upvotes: 1

Related Questions