Reputation: 103
I have a file that I have got using the command page.css("table.vc_result span a"), I am not able to get the second and third Span element of the file:
File
<table border="0" bgcolor="#FFFFFF" onmouseout="resDef(this)" onmouseover="resEmp(this)" class="vc_result">
<tbody>
<tr>
<td width="260" valign="top">
<table>
<tbody>
<tr>
<td width="40%" valign="top"><span><a class="cAddName" href="/USA/Illinois/Chicago/Yellow+Page+Advertising+And+Telephone+Directory+Publica/gateway-megatech_13478733">
Gateway Megatech</a></span><br>
<span class="cAddText">P.O. BOX 99682, Chicago IL 60696</span></td>
</tr>
<tr>
<td><span class="cAddText">Cook County Illinois</span></td>
</tr>
<tr>
<td><span class="cAddCategory">Yellow Page Advertising And Telephone
Directory Publica Chicago</span></td>
</tr>
</tbody>
</table>
</td>
<td width="260">
<table align="center">
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<div style=
"background: url('images/listings.png');background-position: -0px -0px; width: 16px; height: 16px">
</div>
</td>
<td><font style="font-weight:bold">847-506-7800</font></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<div style=
"background: url('images/listings.png');background-position: -0px -78px; width: 16px; height: 16px">
</div>
</td>
<td><a href=
"/USA/Illinois/Chicago/Yellow+Page+Advertising+And+Telephone+Directory+Publica/gateway-megatech_13478733"
class="cAddNearby">Businesses near 60696</a></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
...This is not the complete file there are plenty more span entries in that file.
The code that I am using is able to locate the exact text but not able to associate it with the text of the nested element Span A.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
name="yellow"
city="Chicago"
state="IL"
burl="http://www.sitename.com/"
url="#{burl}Business_Listings.php?name=#{name}&city=#{city}&state=#{state}¤t=1&Submit=Search"
page = Nokogiri::HTML(open(url))
rows = page.css("table.vc_result span a")
rows.each do |arow|
if arow.text == "Gateway Megatech"
puts(arow.next_element.text)
puts("Capturing the next span text")
found="Got it"
break
else
puts("Found nothing")
found="None"
end
end
Upvotes: 1
Views: 93
Reputation: 303178
Assuming that each business is a new <tr>
inside the top table you have supplied, the following code gives you an array of Hashes with the values:
require 'nokogiri'
doc = Nokogiri.HTML(html)
business_rows = doc.css('table.vc_result > tbody > tr')
details = business_rows.map do |tr|
# Inside the first <td> of the row, find a <td> with a.cAddName in it
business = tr.at_xpath('td[1]//td[//a[@class="cAddName"]]')
name = business.at_css('a.cAddName').text.strip
address = business.at_css('.cAddText').text.strip
# Inside the second <td> of the row, find the first <font> tag
phone = tr.at_xpath('td[2]//font').text.strip
# Return a hash of values for this row, using the capitalization requested
{ Name:name, Address:address, Phone:phone }
end
p details
#=> [
#=> {
#=> :Name=>"Gateway Megatech",
#=> :Address=>"P.O. BOX 99682, Chicago IL 60696",
#=> :Phone=>"847-506-7800"
#=> }
#=> ]
This is pretty fragile, but works for what you've given, and there do not seem to be very many semantic items to hang onto in this insane, horrorific abuse of HTML.
Upvotes: 2
Reputation: 1426
Parsing HTML with regular expressions is a bad idea, because HTML is not a regular language. Ideally, you want to parse the DOM / XML to a tree structure.
http://nokogiri.org/ is pretty popular.
Upvotes: 0