Duck1337
Duck1337

Reputation: 524

Formatting HTML into CSV

I'm scraping a website using Ruby with Nokogiri.

This script creates a local text file, opens a URL, and writes to the file if the expression tr td is met. It is working fine.

require 'rubygems'
require 'nokogiri'
require 'open-uri'

DOC_URL_FILE = "doc.csv" 

url = "http://www.SuperSecretWebSite.com"

data = Nokogiri::HTML(open(url))


all_data = data.xpath('//tr/td').text

File.open(DOC_URL_FILE, 'w'){|file| file.write all_data} 

Each line has five fields which I would like to run horizontally then go to the next line after five cells are filled. The data is all there but isn't usable.

I was hoping to learn or get the code from someone that knows how to create a CSV formatting code that:

  1. While the script is reading the code, dump every new td /td x5 into its own cells horizontally.
  2. Go to the next line, etc.

The layout of the HTML is:

<tr>
    <td>John Smith</td>
    <td>I live here 123</td>
    <td>phone ###</td>
    <td>Birthday</td>
    <td>Other Data</td>
</tr>

What the final product should look like.

http://picpaste.com/pics/Screenshot-KRnqRGrP.1361813552.png

current output

    john Smith      I live here 123  phone ### Birthday Other Data,

Upvotes: 0

Views: 1999

Answers (1)

the Tin Man
the Tin Man

Reputation: 160631

This is pretty standard code to walk a table and extract its cells into an array of arrays. What you do with the data at that point is up to you, but it's a very easy to pass it to CSV.

require 'nokogiri'
require 'pp'

doc = Nokogiri::HTML(<<EOT)
<table>
  <tr>
    <td>John Smith</td>
    <td>I live here 123</td>
    <td>phone ###</td>
    <td>Birthday</td>
    <td>Other Data</td>
  </tr>
  <tr>
    <td>John Smyth</td>
    <td>I live here 456</td>
    <td>phone ###</td>
    <td>Birthday</td>
    <td>Other Data</td>
  </tr>
</table>
EOT

data = []
doc.at('table').search('tr').each do |tr|
  data << tr.search('td').map(&:text)
end

pp data

Which outputs:

[["John Smith", "I live here 123", "phone ###", "Birthday", "Other Data"],
["John Smyth", "I live here 456", "phone ###", "Birthday", "Other Data"]]

The code uses at to locate the first <table>, then iterates over each <tr> using search. For each row, it iterates over the cells and extracts their text.

Nokogiri's at finds the first occurrence of something, and returns a Node. search finds all occurrences and returns a NodeSet, which acts like an array. I'm using CSS accessors, instead of XPath, for simplicity.


As a FYI:

File.open(DOC_URL_FILE, 'w'){|file| file.write all_data} 

can be written more succinctly as:

File.write(DOC_URL_FILE, all_data)

I've been working on this problem for awhile. Can you give me any more help?

Sigh...

Did you read the CSV documents, especially the examples? What happens if, instead of defining data = [] we replace it with:

CSV.open("path/to/file.csv", "wb") do |data|

and wrap the loop with the CSV block, like:

CSV.open("path/to/file.csv", "wb") do |data|
  doc.at('table').search('tr').each do |tr|
    data << tr.search('td').map(&:text)
  end
end

That's not tested, but it's really that simple. Go and fiddle with that.

Upvotes: 5

Related Questions