What is best Ruby Class design / pattern for this scenario?

Question

I currently have this class for scraping products from a single retailer website using Nokogiri. XPath, CSS path details are stored in MySQL.

ActiveRecord::Base.establish_connection( 
  :adapter => "mysql2",
  ...
)

class Site < ActiveRecord::Base
  has_many :site_details

  def create_product_links
    # http://www.example.com
    p = Nokogiri::HTML(open(url))
    p.xpath(total_products_path).each {|lnk| SiteDetail.find_or_create_by(url: url + "/" + lnk['href'], site_id: self.id)}
  end    
end

class SiteDetail < ActiveRecord::Base
  belongs_to :site   

  def get_product_data
    # http://www.example.com
    p = Nokogiri::HTML(open(url))
    title = p.css(site.title_path).text
    price = p.css(site.price_path).text
    description = p.css(site.description_path).text
    update_attributes!(title: title, price: price, description: description)
  end 
end

# Execution
@s = Site.first
@s.site_details.get_product_data

I will be adding more sites (around 700) in the future. Each site have a different page structure. So get_product_data method cannot be used as is. I may have to use case or if statement to jump and execute relevant code. Soon this class becomes quite chunky and ugly (700 retailers).

What is the best design approach suitable in this scenario?

Kevin · Accepted Answer

Like @James Woodward said, you're going to want to create a class for each retailer. The pattern I'm going to post has three parts:

A couple of ActiveRecord classes that implement a common interface for storing the data you want to record from each site
700 different classes, one for each site you want to scrape. These classes implement the algorithms for scraping the sites, but don't know how to store the information in the database. To do that, they rely on the common interface from step 1.
One final class that ties it all together running each of the scraping algorithms you wrote in step 2.

Step 1: `ActiveRecord` Interface

This step is pretty easy. You already have a Site and SiteDetail class. You can keep them for storing the data you scrape from website in your database.

You told the Site and SiteDetail classes how to scrape data from websites. I would argue this is inappropriate. Now you've given the classes two responsibilities:

Persist data in the database
Scrape data from the websites

We'll create new classes do handle the scraping responsibility in the second step. For now, you can strip down the Site and SiteDetail classes so that they only act as database records:

class Site < ActiveRecord::Base
  has_many :site_details
end

class SiteDetail < ActiveRecord::Base
  belongs_to :site
end

Step 2: Implement Scrapers

Now, we'll create new classes that handle the scraping responsibility. If this were a language that supported abstract classes or interfaces like Java or C#, we would proceed like so:

Create an IScraper or AbstractScraper interface that handles the tasks common to scraping a website.
Implement a different FooScraper class for each of the sites you want to scrape, each one inheriting from AbstractScraper or implementing IScraper.

Ruby doesn't have abstract classes, though. What it does have is duck typing and module mix-ins. This means we'll use this very similar pattern:

Create a SiteScraper module that handles the tasks common to scraping a website. This module will assume that the classes that extend it have certain methods it can call.
Implement a different FooScraper class for each of the sites you want to scrape, each one mixing in the SiteScraper module and implementing the methods the module expects.

It looks like this:

module SiteScraper
  # Assumes that classes including the module
  # have get_products and get_product_details methods
  #
  # The get_product_urls method should return a list
  # of the URLs to visit to get scraped data
  #
  # The get_product_details the URL of the product to
  # scape as a string and return a SiteDetail with data
  # scraped from the given URL 
  def get_data
    site = Site.new
    product_urls = get_product_urls

    for product_url in product_urls
      site_detail = get_product_details product_url
      site_detail.site = site
      site_detail.save
    end
  end
end 

class ExampleScraper
  include 'SiteScraper'

  def get_product_urls
    urls = []
    p = Nokogiri::HTML(open('www.example.com/products'))
    p.xpath('//products').each {|lnk| urls.push lnk}
    return urls
  end

  def get_product_details(product_url)
    p = Nokogiri::HTML(open(product_url))
    title = p.css('//title').text
    price = p.css('//price').text
    description = p.css('//description').text

    site_detail = SiteDetail.new
    site_detail.title = title
    site_detail.price = price
    site_detail.description = description
    return site_detail
  end
end

class FooBarScraper
  include 'SiteScraper'

  def get_product_urls
    urls = []
    p = Nokogiri::HTML(open('www.foobar.com/foobars'))
    p.xpath('//foo/bar').each {|lnk| urls.push lnk}
    return urls
  end

  def get_product_details(product_url)
    p = Nokogiri::HTML(open(product_url))
    title = p.css('//foo').text
    price = p.css('//bar').text
    description = p.css('//foo/bar/iption').text

    site_detail = SiteDetail.new
    site_detail.title = title
    site_detail.price = price
    site_detail.description = description
    return site_detail
  end
end

... and so on, creating a class that mixes in SiteScraper and implements get_product_urls and get_product_details for each one of the 700 website you need to scrape. Unfortunately, this is the tedious part of the pattern: There's no real way to get around writing a different scraping algorithm for all 700 sites.

Step 3: Run Each Scraper

The final step is to create the cron job that scrapes the sites.

every :day, at: '12:00am' do
  ExampleScraper.new.get_data
  FooBarScraper.new.get_data
  # + 698 more lines
end

What is best Ruby Class design / pattern for this scenario?

Answers (1)

Step 1: `ActiveRecord` Interface

Step 2: Implement Scrapers

Step 3: Run Each Scraper

Related Questions

What is best Ruby Class design / pattern for this scenario?

Answers (1)

Step 1: ActiveRecord Interface

Step 2: Implement Scrapers

Step 3: Run Each Scraper

Related Questions

Step 1: `ActiveRecord` Interface