Rails beginner
Rails beginner

Reputation: 14514

Mechanize - Simpliest way to check if page have been updated?

What is the simpiest solution with Mechanize to see if a page have been updated?

I was thinking about create a table named pages.

That would have:

pagename - varchar
page - text
pageupdated - boolean

How should I create the screen scraper and save the data in the database? And how to create an method to compare the html in the table with the scraped data. To check if the page have been updated.

Upvotes: 2

Views: 856

Answers (1)

Adam Eberlin
Adam Eberlin

Reputation: 14205

Answer updated and tested.

Here's an example using a Page model (and using retryable-rb):

rails generate scaffold Page name:string remote_url:string page:text digest:text page_updated:boolean

####### app/models/page.rb

require 'digest'
require 'retryable'

class Page < ActiveRecord::Base
  include Retryable

  # Scrape page before validation
  before_validation :scrape_content, :if => :remote_url?

  # Will cause save to fail if page could not be retrieved
  validates_presence_of :page, :if => :remote_url?, :message => "URL provided is invalid or inaccessible."

  # Update digest if/when all validations have passed
  before_save :set_digest

  # ...

  def update_page!
    self.scrape_content
    self.set_digest
    self.save!
  end

  def page_updated?
    self.page_updated
  end

  protected

  def scrape_content
    ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X) ' + 
         'AppleWebKit/535.1 (KHTML, like Gecko) ' + 
         'Chrome/14.0.835.186 Safari/535.1'

    # Using retryable, create scraper and get page
    scraper = Mechanize.new{ |i| i.user_agent = ua }
    scraped_page = retryable(:times => 3, :sleep => false) do
      scraper.get(URI.encode(self.remote_url))
    end
    self.page_updated = false
    self.page = scraped_page.content
    self.name ||= scraped_page.title
    self.digest ||= Digest.hexencode(self.page)
  end

  def set_digest
    # Create new digest of page content
    new_digest = Digest.hexencode(self.page)

    # If digest has changed, update digest and set flag
    if (new_digest != self.digest) && !self.digest.nil?
      self.digest = new_digest
      self.page_updated = true
    else
      self.page_updated = false
    end

    true
  end

end

I'm fairly sure this is an unrelated matter, but I seem to be encountering an LoadError when trying to require 'mechanize' in rails console and my test application. Not sure what's causing this, but I'll update my answer when I'm able to successfully test this solution.

Make sure you remember to add this to your application's Gemfile:

gem 'mechanize', '2.0.1'
gem 'retryable-rb', '1.1.0'

Usage Example:

p = Page.new(:remote_url => 'http://rubyonrails.org/')
p.save!
p.page_updated? # => false, since page hasn't been updated since creation
p.remote_url = 'http://www.google.com/' # for the sake of example
p.update_page!
p.page_updated? # => true

Upvotes: 1

Related Questions