Reputation: 53
I'm trying to scrape a page with financial data using Nokogiri and Ruby 1.9.3.
I'm having trouble getting the right XPath or CSS filter to get the table that holds the data, then iterate through the data and assemble it so the output can be put into a CSV file like this:
Date, Company,Symbol,ReportedEPS,Consensus EPS
20130828,CDN WESTERN BANK,CWB.TO,0.60,0.59
I used Firebug to get the XPath and CSS data. What is the correct format for XPath or CSS to extract the table then iterate through the lines to assemble them for output to a file?
require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'uri'
@agent = Mechanize.new do|a|
a.user_agent_alias = "Windows IE 6"
end
url = "http://biz.yahoo.com/z/20130828.html"
page = @agent.get(url)
doc = Nokogiri::HTML(page.body)
puts doc.inspect
#~ from firebug
#~ xpath /html/body/p[3]/table/tbody
#~ css html body p table tbody
Upvotes: 0
Views: 376
Reputation: 160551
I generally use CSS over XPath, for readability. This is something like I'd use:
require 'open-uri'
require 'nokogiri'
URL = "http://biz.yahoo.com/z/20130828.html"
doc = Nokogiri::HTML(open(URL))
table = doc.css('table')[4]
data = table.search('tr')[2..-1].map { |row|
row.search('td').map(&:text)
}
data
# => [["CDN WESTERN BANK",
# "CWB.TO",
# "1.69",
# "0.60",
# "0.59",
# "N/A",
# "Quote, Chart, News, ProfileReports, Research"],
# ["Casella Waste Systems, Inc.",
# "CWST",
# "71.43",
# "-0.02",
# "-0.07",
# "N/A",
# "Quote, Chart, News, ProfileReports, Research, Msgs, Insider, Analyst Ratings"],
# ["Culp, Inc. Common Stock",
# "CFI",
# "5.56",
# "0.38",
# "0.36",
# "Listen",
# "Quote, Chart, News, ProfileReports, Research, Msgs, Insider, Analyst Ratings"],
There's a lot more data returned, but that's sufficient to show what the code is grabbing.
It's not at all necessary to use Mechanize for this task. Unless you need to navigate through a site, Mechanize isn't helping you very much, so I'd go with OpenURI.
See "How to avoid joining all text from Nodes when scraping" also.
Upvotes: 1
Reputation: 434685
Some browsers will add a <tbody>
to a <table>
while they're parsing/validating/fixing the incoming HTML. Firefox is one of those browsers. The XPath and CSS expressions that you're getting out of Firefox are for the HTML as Firefox sees it and that's not necessarily the HTML as Nokogiri will see it.
Drop the <tbody>
and try this XPath:
/html/body/p[3]/table
to locate the table. You can also look at the raw HTML and see if there is an id
attribute or class
attribute on the table that you can use with CSS id
(#the-id
) or class (.the-class
) selectors instead of a large path of elements.
Upvotes: 2