How do I specify XPATH or CSS in Nokogiri to scrape a page's table data?

Question

I'm trying to scrape a page with financial data using Nokogiri and Ruby 1.9.3.

I'm having trouble getting the right XPath or CSS filter to get the table that holds the data, then iterate through the data and assemble it so the output can be put into a CSV file like this:

Date, Company,Symbol,ReportedEPS,Consensus EPS  
20130828,CDN WESTERN BANK,CWB.TO,0.60,0.59

I used Firebug to get the XPath and CSS data. What is the correct format for XPath or CSS to extract the table then iterate through the lines to assemble them for output to a file?

require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'uri'

@agent = Mechanize.new do|a|    
  a.user_agent_alias = "Windows IE 6"
end

url = "http://biz.yahoo.com/z/20130828.html"
page = @agent.get(url)
doc = Nokogiri::HTML(page.body)
puts doc.inspect 

#~ from firebug
#~ xpath        /html/body/p[3]/table/tbody
#~ css      html body p table tbody

the Tin Man · Accepted Answer

I generally use CSS over XPath, for readability. This is something like I'd use:

require 'open-uri'
require 'nokogiri'

URL = "http://biz.yahoo.com/z/20130828.html"
doc = Nokogiri::HTML(open(URL))
table = doc.css('table')[4]

data = table.search('tr')[2..-1].map { |row|
  row.search('td').map(&:text)
}

data
# => [["CDN WESTERN BANK",
#      "CWB.TO",
#      "1.69",
#      "0.60",
#      "0.59",
#      "N/A",
#      "Quote, Chart, News, ProfileReports, Research"],
#     ["Casella Waste Systems, Inc.",
#      "CWST",
#      "71.43",
#      "-0.02",
#      "-0.07",
#      "N/A",
#      "Quote, Chart, News, ProfileReports, Research, Msgs, Insider, Analyst Ratings"],
#     ["Culp, Inc. Common Stock",
#      "CFI",
#      "5.56",
#      "0.38",
#      "0.36",
#      "Listen",
#      "Quote, Chart, News, ProfileReports, Research, Msgs, Insider, Analyst Ratings"],

There's a lot more data returned, but that's sufficient to show what the code is grabbing.

It's not at all necessary to use Mechanize for this task. Unless you need to navigate through a site, Mechanize isn't helping you very much, so I'd go with OpenURI.

See "How to avoid joining all text from Nodes when scraping" also.

How do I specify XPATH or CSS in Nokogiri to scrape a page's table data?

Answers (2)

Related Questions

How do I specify XPATH or CSS in Nokogiri to scrape a page&#39;s table data?

Answers (2)

Related Questions

How do I specify XPATH or CSS in Nokogiri to scrape a page's table data?