Reputation: 5389
I wrote Ruby scraper to grab campaign finance data from the California senate and then save each individual as a hash. Here's the code so far:
Here's the main website: http://cal-access.sos.ca.gov/Campaign/Candidates/
here's an example of a candidate page: http://cal-access.sos.ca.gov/Campaign/Committees/Detail.aspx?id=1342974&session=2011&view=received
And here's the github repo incase you want to see my comments in the code: https://github.com/aboutaaron/Baugh-For-Senate-2012/blob/master/final-exam.rb
On to the code...
require 'nokogiri'
require 'open-uri'
campaign_data = Nokogiri::HTML(open('http://cal-access.sos.ca.gov/Campaign/Candidates/'))
class Candidate
def initialize(url)
@url = url
@cal_access_url = "http://cal-access.sos.ca.gov"
@nodes = Nokogiri::HTML(open(@cal_access_url + @url))
end
def get_summary
candidate_page = @nodes
{
:political_party => candidate_page.css('span.hdr15').text,
:current_status => candidate_page.css('td tr:nth-child(2) td:nth-child(2) .txt7')[0].text,
:last_report_date => candidate_page.css('td tr:nth-child(3) td:nth-child(2) .txt7')[0].text,
:reporting_period => candidate_page.css('td tr:nth-child(4) td:nth-child(2) .txt7')[0].text,
:contributions_this_period => candidate_page.css('td tr:nth-child(5) td:nth-child(2) .txt7')[0].text.gsub(/[$,](?=\d)/, ''),
:total_contributions_this_period => candidate_page.css('td tr:nth-child(6) td:nth-child(2) .txt7')[0].text.gsub(/[$,](?=\d)/, ''),
:expenditures_this_period => candidate_page.css('td tr:nth-child(7) td:nth-child(2) .txt7')[0].text.gsub(/[$,](?=\d)/, ''),
:total_expenditures_this_period => candidate_page.css('td tr:nth-child(8) td:nth-child(2) .txt7')[0].text.gsub(/[$,](?=\d)/, ''),
:ending_cash => candidate_page.css('td tr:nth-child(9) td:nth-child(2) .txt7')[0].text.gsub(/[$,](?=\d)/, '')
}
end
def get_contributors
contributions_received = @nodes
grab_contributor_page = @nodes.css("a.sublink6")[0]['href']
contributor_page = Nokogiri::HTML(open(@cal_access_url + grab_contributor_page))
grab_contributions_page = contributor_page.css("a")[25]["href"]
contributions_received = Nokogiri::HTML(open(@cal_access_url + grab_contributions_page))
puts
puts "#{@cal_access_url}" + "#{grab_contributions_page}"
puts
contributions_received.css("table").reduce([]) do |memo, contributors|
begin
memo << {
:name_of_contributor => contributions_received.css("table:nth-child(57) tr:nth-child(2) td:nth-child(1) .txt7").text
}
rescue NoMethodError => e
puts e.message
puts "Error on #{contributors}"
end
memo
end
end
end
campaign_data.css('a.sublink2').each do |candidates|
puts "Just grabbed the page for " + candidates.text
candidate = Candidate.new(candidates["href"])
p candidate.get_summary
end
get_summary
works as planned. get_contributors
stores the first contributor <td>
as planned, but does it 20-plus times. I'm only choosing to grab the name for now until I figure out the multiple printing issue.
The end goal is to have a hash of the contributors with all of their required information and possibly move them into a SQL database/Rails app. But, before, I just want a working scraper.
Any advice or guidance? Sorry if the code isn't super. Super newbie to programming.
Upvotes: 2
Views: 914
Reputation: 107979
You're doing great. Good job on providing a stand-alone sample. You'd be surprised how many don't do that.
I see two problems.
The first is that not all pages have the statistics you're looking for. This causes your parsing routines to get a bit upset. To guard against that, you can put this in get_summary
:
return nil if candidate_page.text =~ /has not electronically filed/i
The caller should then do something intelligent when it sees a nil.
The other problem is that the server sometimes doesn't respond in a timely fashion, so the script times out. If you think the server is getting upset at the rate with which your script is making requests, you can try adding some sleeps to slow it down. Or, you could add a retry loop. Or, you could increase the amount of time it takes for your script to time out.
There is also some duplication of logic in get_summary
. This function might benefit from a separation of policy from logic. The policy is what data to retrieve from the page, and how to format it:
FORMAT_MONEY = proc do |s|
s.gsub(/[$,](?=\d)/, '')
end
FIELDS = [
[:political_party, 'span.hdr15'],
[:current_status, 'td tr:nth-child(2) td:nth-child(2) .txt7'],
[:last_report_date, 'td tr:nth-child(3) td:nth-child(2) .txt7'],
[:reporting_period, 'td tr:nth-child(4) td:nth-child(2) .txt7'],
[:contributions_this_period, 'td tr:nth-child(5) td:nth-child(2) .txt7', FORMAT_MONEY],
[:total_contributions_this_period, 'td tr:nth-child(6) td:nth-child(2) .txt7', FORMAT_MONEY],
[:expenditures_this_period, 'td tr:nth-child(7) td:nth-child(2) .txt7', FORMAT_MONEY],
[:total_expenditures_this_period, 'td tr:nth-child(8) td:nth-child(2) .txt7', FORMAT_MONEY],
[:ending_cash, 'td tr:nth-child(9) td:nth-child(2) .txt7', FORMAT_MONEY],
]
The implementation is how to apply that policy to the HTML page:
def get_summary
candidate_page = @nodes
return nil if candidate_page.text =~ /has not electronically filed/i
keys_and_values = FIELDS.map do |key, css_selector, format|
value = candidate_page.css(css_selector)[0].text
value = format[value] if format
[key, value]
end
Hash[keys_and_values]
end
Upvotes: 1