Reputation: 89313
I'd like to parse a simple table into a Ruby data structure. The table looks like this:
alt text http://img232.imageshack.us/img232/446/picture5cls.png http://img232.imageshack.us/img232/446/picture5cls.png
Edit: Here is the HTML
and I'd like to parse it into an array of hashes. E.g.,:
schedule[0]['NEW HAVEN'] == '4:12AM'
schedule[0]['Travel Time In Minutes'] == '95'
Any thoughts on how to do this? Perl has HTML::TableExtract, which I think would do the job, but I can't find any similar library for Ruby.
Upvotes: 3
Views: 2772
Reputation: 52326
You might like to try Hpricot (gem install hpricot
, prepend the usual sudo
for *nix systems)
I placed your HTML into input.html
, then ran this:
require 'hpricot'
doc = Hpricot.XML(open('input.html'))
table = doc/:table
(table/:tr).each do |row|
(row/:td).each do |cell|
puts cell.inner_html
end
end
which, for the first row, gives me
<span class="black">12:17AM </span>
<span class="black">
<a href="http://www.mta.info/mnr/html/planning/schedules/ref.htm"></a></span>
<span class="black">1:22AM </span>
<span class="black">
<a href="http://www.mta.info/mnr/html/planning/schedules/ref.htm"></a></span>
<span class="black">65</span>
<span class="black">TRANSFER AT STAMFORD (AR 1:01AM & LV 1:05AM) </span>
<span class="black">
N
</span>
So already we're down to the content of the TD
tags. A little more work and you're about there.
(BTW, the HTML looks a little malformed: you have <th>
tags in <tbody>
, which seems a bit perverse: <tbody>
is fairly pointless if it's just going to be another level within <table>
. It makes much more sense if your <tr><th>...</th></tr>
stuff is in a separate <thead>
section within the table. But it may not be "your" HTML, of course!)
Upvotes: 5
Reputation: 370357
In case there isn't a library to do that for ruby, here's some code to get you started writing this yourself:
require 'nokogiri'
doc=Nokogiri("<table><tr><th>la</th><th><b>lu</b></th></tr><tr><td>lala</td><td>lulu</td></tr><tr><td><b>lila</b></td><td>lolu</td></tr></table>")
header, *rest = (doc/"tr").map do |row|
row.children.map do |c|
c.text
end
end
header.map! do |str| str.to_sym end
item_struct = Struct.new(*header)
table = rest.map do |row|
item_struct.new(*row)
end
table[1].lu #=> "lolu"
This code is far from perfect, obviously, but it should get you started.
Upvotes: 2