Reputation: 11244
I currently have this class for scraping products from a single retailer website using Nokogiri. XPath, CSS path details are stored in MySQL.
ActiveRecord::Base.establish_connection(
:adapter => "mysql2",
...
)
class Site < ActiveRecord::Base
has_many :site_details
def create_product_links
# http://www.example.com
p = Nokogiri::HTML(open(url))
p.xpath(total_products_path).each {|lnk| SiteDetail.find_or_create_by(url: url + "/" + lnk['href'], site_id: self.id)}
end
end
class SiteDetail < ActiveRecord::Base
belongs_to :site
def get_product_data
# http://www.example.com
p = Nokogiri::HTML(open(url))
title = p.css(site.title_path).text
price = p.css(site.price_path).text
description = p.css(site.description_path).text
update_attributes!(title: title, price: price, description: description)
end
end
# Execution
@s = Site.first
@s.site_details.get_product_data
I will be adding more sites (around 700) in the future. Each site have a different page structure. So get_product_data
method cannot be used as is. I may have to use case or if statement
to jump and execute relevant code. Soon this class becomes quite chunky and ugly (700 retailers).
What is the best design approach suitable in this scenario?
Upvotes: 3
Views: 233
Reputation: 15964
Like @James Woodward said, you're going to want to create a class for each retailer. The pattern I'm going to post has three parts:
ActiveRecord
classes that implement a common interface for storing the data you want to record from each siteActiveRecord
InterfaceThis step is pretty easy. You already have a Site
and SiteDetail
class. You can keep them for storing the data you scrape from website in your database.
You told the Site
and SiteDetail
classes how to scrape data from websites. I would argue this is inappropriate. Now you've given the classes two responsibilities:
We'll create new classes do handle the scraping responsibility in the second step. For now, you can strip down the Site
and SiteDetail
classes so that they only act as database records:
class Site < ActiveRecord::Base
has_many :site_details
end
class SiteDetail < ActiveRecord::Base
belongs_to :site
end
Now, we'll create new classes that handle the scraping responsibility. If this were a language that supported abstract classes or interfaces like Java or C#, we would proceed like so:
IScraper
or AbstractScraper
interface that handles the tasks common to scraping a website.FooScraper
class for each of the sites you want to scrape, each one inheriting from AbstractScraper
or implementing IScraper
.Ruby doesn't have abstract classes, though. What it does have is duck typing and module mix-ins. This means we'll use this very similar pattern:
SiteScraper
module that handles the tasks common to scraping a website. This module will assume that the classes that extend it have certain methods it can call.FooScraper
class for each of the sites you want to scrape, each one mixing in the SiteScraper
module and implementing the methods the module expects.It looks like this:
module SiteScraper
# Assumes that classes including the module
# have get_products and get_product_details methods
#
# The get_product_urls method should return a list
# of the URLs to visit to get scraped data
#
# The get_product_details the URL of the product to
# scape as a string and return a SiteDetail with data
# scraped from the given URL
def get_data
site = Site.new
product_urls = get_product_urls
for product_url in product_urls
site_detail = get_product_details product_url
site_detail.site = site
site_detail.save
end
end
end
class ExampleScraper
include 'SiteScraper'
def get_product_urls
urls = []
p = Nokogiri::HTML(open('www.example.com/products'))
p.xpath('//products').each {|lnk| urls.push lnk}
return urls
end
def get_product_details(product_url)
p = Nokogiri::HTML(open(product_url))
title = p.css('//title').text
price = p.css('//price').text
description = p.css('//description').text
site_detail = SiteDetail.new
site_detail.title = title
site_detail.price = price
site_detail.description = description
return site_detail
end
end
class FooBarScraper
include 'SiteScraper'
def get_product_urls
urls = []
p = Nokogiri::HTML(open('www.foobar.com/foobars'))
p.xpath('//foo/bar').each {|lnk| urls.push lnk}
return urls
end
def get_product_details(product_url)
p = Nokogiri::HTML(open(product_url))
title = p.css('//foo').text
price = p.css('//bar').text
description = p.css('//foo/bar/iption').text
site_detail = SiteDetail.new
site_detail.title = title
site_detail.price = price
site_detail.description = description
return site_detail
end
end
... and so on, creating a class that mixes in SiteScraper
and implements get_product_urls
and get_product_details
for each one of the 700 website you need to scrape. Unfortunately, this is the tedious part of the pattern: There's no real way to get around writing a different scraping algorithm for all 700 sites.
The final step is to create the cron job that scrapes the sites.
every :day, at: '12:00am' do
ExampleScraper.new.get_data
FooBarScraper.new.get_data
# + 698 more lines
end
Upvotes: 1