Scraping with Ruby and storing in a hash

Question

I wrote Ruby scraper to grab campaign finance data from the California senate and then save each individual as a hash. Here's the code so far:

Here's the main website: http://cal-access.sos.ca.gov/Campaign/Candidates/

here's an example of a candidate page: http://cal-access.sos.ca.gov/Campaign/Committees/Detail.aspx?id=1342974&session=2011&view=received

And here's the github repo incase you want to see my comments in the code: https://github.com/aboutaaron/Baugh-For-Senate-2012/blob/master/final-exam.rb

On to the code...

require 'nokogiri'
require 'open-uri'

campaign_data =  Nokogiri::HTML(open('http://cal-access.sos.ca.gov/Campaign/Candidates/'))

class Candidate
def initialize(url)
    @url = url
    @cal_access_url = "http://cal-access.sos.ca.gov"
    @nodes =  Nokogiri::HTML(open(@cal_access_url + @url))
end

def get_summary
    candidate_page = @nodes

    {
        :political_party => candidate_page.css('span.hdr15').text,
        :current_status => candidate_page.css('td tr:nth-child(2) td:nth-child(2) .txt7')[0].text,
        :last_report_date => candidate_page.css('td tr:nth-child(3) td:nth-child(2) .txt7')[0].text,
        :reporting_period => candidate_page.css('td tr:nth-child(4) td:nth-child(2) .txt7')[0].text,
        :contributions_this_period => candidate_page.css('td tr:nth-child(5) td:nth-child(2) .txt7')[0].text.gsub(/[$,](?=\d)/, ''),
        :total_contributions_this_period => candidate_page.css('td tr:nth-child(6) td:nth-child(2) .txt7')[0].text.gsub(/[$,](?=\d)/, ''),
        :expenditures_this_period => candidate_page.css('td tr:nth-child(7) td:nth-child(2) .txt7')[0].text.gsub(/[$,](?=\d)/, ''),
        :total_expenditures_this_period => candidate_page.css('td tr:nth-child(8) td:nth-child(2) .txt7')[0].text.gsub(/[$,](?=\d)/, ''),
        :ending_cash => candidate_page.css('td tr:nth-child(9) td:nth-child(2) .txt7')[0].text.gsub(/[$,](?=\d)/, '')
    }
end

def get_contributors
    contributions_received = @nodes
    grab_contributor_page = @nodes.css("a.sublink6")[0]['href']
    contributor_page = Nokogiri::HTML(open(@cal_access_url + grab_contributor_page))
    grab_contributions_page = contributor_page.css("a")[25]["href"]
    contributions_received = Nokogiri::HTML(open(@cal_access_url + grab_contributions_page))
    puts
    puts "#{@cal_access_url}" + "#{grab_contributions_page}"
    puts

    contributions_received.css("table").reduce([]) do |memo, contributors|
        begin

            memo << {
                :name_of_contributor => contributions_received.css("table:nth-child(57) tr:nth-child(2) td:nth-child(1) .txt7").text
            }

        rescue NoMethodError => e
            puts e.message
            puts "Error on #{contributors}"
        end
        memo
    end
end

end

campaign_data.css('a.sublink2').each do |candidates|
puts "Just grabbed the page for " + candidates.text
candidate = Candidate.new(candidates["href"])
p candidate.get_summary
end

get_summary works as planned. get_contributors stores the first contributor as planned, but does it 20-plus times. I'm only choosing to grab the name for now until I figure out the multiple printing issue.

The end goal is to have a hash of the contributors with all of their required information and possibly move them into a SQL database/Rails app. But, before, I just want a working scraper.

Any advice or guidance? Sorry if the code isn't super. Super newbie to programming.

Wayne Conrad · Accepted Answer

You're doing great. Good job on providing a stand-alone sample. You'd be surprised how many don't do that.

I see two problems.

The first is that not all pages have the statistics you're looking for. This causes your parsing routines to get a bit upset. To guard against that, you can put this in get_summary:

return nil if candidate_page.text =~ /has not electronically filed/i

The caller should then do something intelligent when it sees a nil.

The other problem is that the server sometimes doesn't respond in a timely fashion, so the script times out. If you think the server is getting upset at the rate with which your script is making requests, you can try adding some sleeps to slow it down. Or, you could add a retry loop. Or, you could increase the amount of time it takes for your script to time out.

There is also some duplication of logic in get_summary. This function might benefit from a separation of policy from logic. The policy is what data to retrieve from the page, and how to format it:

FORMAT_MONEY = proc do |s|
  s.gsub(/[$,](?=\d)/, '')
end

FIELDS = [
  [:political_party, 'span.hdr15'],
  [:current_status, 'td tr:nth-child(2) td:nth-child(2) .txt7'],
  [:last_report_date, 'td tr:nth-child(3) td:nth-child(2) .txt7'],
  [:reporting_period, 'td tr:nth-child(4) td:nth-child(2) .txt7'],
  [:contributions_this_period, 'td tr:nth-child(5) td:nth-child(2) .txt7', FORMAT_MONEY],
  [:total_contributions_this_period, 'td tr:nth-child(6) td:nth-child(2) .txt7', FORMAT_MONEY],
  [:expenditures_this_period, 'td tr:nth-child(7) td:nth-child(2) .txt7', FORMAT_MONEY],
  [:total_expenditures_this_period, 'td tr:nth-child(8) td:nth-child(2) .txt7', FORMAT_MONEY],
  [:ending_cash, 'td tr:nth-child(9) td:nth-child(2) .txt7', FORMAT_MONEY],
]

The implementation is how to apply that policy to the HTML page:

def get_summary
  candidate_page = @nodes
  return nil if candidate_page.text =~ /has not electronically filed/i
  keys_and_values = FIELDS.map do |key, css_selector, format|
    value = candidate_page.css(css_selector)[0].text
    value = format[value] if format
    [key, value]
  end
  Hash[keys_and_values]
end

Scraping with Ruby and storing in a hash

Answers (1)

Related Questions