Reputation: 524
I'm scraping a website using Ruby with Nokogiri.
This script creates a local text file, opens a URL, and writes to the file if the expression tr td
is met. It is working fine.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
DOC_URL_FILE = "doc.csv"
url = "http://www.SuperSecretWebSite.com"
data = Nokogiri::HTML(open(url))
all_data = data.xpath('//tr/td').text
File.open(DOC_URL_FILE, 'w'){|file| file.write all_data}
Each line has five fields which I would like to run horizontally then go to the next line after five cells are filled. The data is all there but isn't usable.
I was hoping to learn or get the code from someone that knows how to create a CSV formatting code that:
The layout of the HTML is:
<tr>
<td>John Smith</td>
<td>I live here 123</td>
<td>phone ###</td>
<td>Birthday</td>
<td>Other Data</td>
</tr>
What the final product should look like.
http://picpaste.com/pics/Screenshot-KRnqRGrP.1361813552.png
current output
john Smith I live here 123 phone ### Birthday Other Data,
Upvotes: 0
Views: 1999
Reputation: 160631
This is pretty standard code to walk a table and extract its cells into an array of arrays. What you do with the data at that point is up to you, but it's a very easy to pass it to CSV.
require 'nokogiri'
require 'pp'
doc = Nokogiri::HTML(<<EOT)
<table>
<tr>
<td>John Smith</td>
<td>I live here 123</td>
<td>phone ###</td>
<td>Birthday</td>
<td>Other Data</td>
</tr>
<tr>
<td>John Smyth</td>
<td>I live here 456</td>
<td>phone ###</td>
<td>Birthday</td>
<td>Other Data</td>
</tr>
</table>
EOT
data = []
doc.at('table').search('tr').each do |tr|
data << tr.search('td').map(&:text)
end
pp data
Which outputs:
[["John Smith", "I live here 123", "phone ###", "Birthday", "Other Data"],
["John Smyth", "I live here 456", "phone ###", "Birthday", "Other Data"]]
The code uses at
to locate the first <table>
, then iterates over each <tr>
using search
. For each row, it iterates over the cells and extracts their text.
Nokogiri's at
finds the first occurrence of something, and returns a Node. search
finds all occurrences and returns a NodeSet, which acts like an array. I'm using CSS accessors, instead of XPath, for simplicity.
As a FYI:
File.open(DOC_URL_FILE, 'w'){|file| file.write all_data}
can be written more succinctly as:
File.write(DOC_URL_FILE, all_data)
I've been working on this problem for awhile. Can you give me any more help?
Sigh...
Did you read the CSV documents, especially the examples? What happens if, instead of defining data = []
we replace it with:
CSV.open("path/to/file.csv", "wb") do |data|
and wrap the loop with the CSV block, like:
CSV.open("path/to/file.csv", "wb") do |data|
doc.at('table').search('tr').each do |tr|
data << tr.search('td').map(&:text)
end
end
That's not tested, but it's really that simple. Go and fiddle with that.
Upvotes: 5