cage
cage

Reputation: 103

Need help to locate the text of element with class?

I have a file that I have got using the command page.css("table.vc_result span a"), I am not able to get the second and third Span element of the file:

File

<table border="0" bgcolor="#FFFFFF" onmouseout="resDef(this)" onmouseover="resEmp(this)" class="vc_result">
<tbody>
  <tr>
    <td width="260" valign="top">
      <table>
        <tbody>
          <tr>
            <td width="40%" valign="top"><span><a class="cAddName" href="/USA/Illinois/Chicago/Yellow+Page+Advertising+And+Telephone+Directory+Publica/gateway-megatech_13478733">
            Gateway Megatech</a></span><br>
            <span class="cAddText">P.O. BOX 99682, Chicago IL 60696</span></td>
          </tr>

          <tr>
            <td><span class="cAddText">Cook County Illinois</span></td>
          </tr>

          <tr>
            <td><span class="cAddCategory">Yellow Page Advertising And Telephone
            Directory Publica Chicago</span></td>
          </tr>
        </tbody>
      </table>
    </td>

    <td width="260">
      <table align="center">
        <tbody>
          <tr>
            <td>
              <table>
                <tbody>
                  <tr>
                    <td>
                      <div style=
                      "background: url('images/listings.png');background-position: -0px -0px; width: 16px; height: 16px">
                      </div>
                    </td>

                    <td><font style="font-weight:bold">847-506-7800</font></td>
                  </tr>
                </tbody>
              </table>
            </td>
          </tr>

          <tr>
            <td>
              <table>
                <tbody>
                  <tr>
                    <td>
                      <div style=
                      "background: url('images/listings.png');background-position: -0px -78px; width: 16px; height: 16px">
                      </div>
                    </td>

                    <td><a href=
                    "/USA/Illinois/Chicago/Yellow+Page+Advertising+And+Telephone+Directory+Publica/gateway-megatech_13478733"
                    class="cAddNearby">Businesses near 60696</a></td>
                  </tr>
                </tbody>
              </table>
            </td>
          </tr>

          <tr>
            <td>
              <table>
                <tbody>
                  <tr>
                    <td></td>
                  </tr>
                </tbody>
              </table>
            </td>
          </tr>
        </tbody>
      </table>
    </td>
  </tr>
</tbody>
</table>

...This is not the complete file there are plenty more span entries in that file.

The code that I am using is able to locate the exact text but not able to associate it with the text of the nested element Span A.

require 'rubygems'
require 'nokogiri'
require 'open-uri'
name="yellow"
city="Chicago"
state="IL"

burl="http://www.sitename.com/"
url="#{burl}Business_Listings.php?name=#{name}&city=#{city}&state=#{state}&current=1&Submit=Search"
page = Nokogiri::HTML(open(url)) 

rows = page.css("table.vc_result span a")
rows.each do |arow|

  if arow.text == "Gateway Megatech"
    puts(arow.next_element.text)
    puts("Capturing the next span text")
    found="Got it"
    break       
  else
    puts("Found nothing")
    found="None"
  end
end

Upvotes: 1

Views: 93

Answers (2)

Phrogz
Phrogz

Reputation: 303178

Assuming that each business is a new <tr> inside the top table you have supplied, the following code gives you an array of Hashes with the values:

require 'nokogiri'
doc = Nokogiri.HTML(html)

business_rows = doc.css('table.vc_result > tbody > tr')
details = business_rows.map do |tr|
  # Inside the first <td> of the row, find a <td> with a.cAddName in it
  business = tr.at_xpath('td[1]//td[//a[@class="cAddName"]]')
  name     = business.at_css('a.cAddName').text.strip
  address  = business.at_css('.cAddText').text.strip

  # Inside the second <td> of the row, find the first <font> tag
  phone    = tr.at_xpath('td[2]//font').text.strip

  # Return a hash of values for this row, using the capitalization requested
  { Name:name, Address:address, Phone:phone }
end

p details
#=> [
#=>   {
#=>     :Name=>"Gateway Megatech",
#=>     :Address=>"P.O. BOX 99682, Chicago IL 60696",
#=>     :Phone=>"847-506-7800"
#=>   }
#=> ]

This is pretty fragile, but works for what you've given, and there do not seem to be very many semantic items to hang onto in this insane, horrorific abuse of HTML.

Upvotes: 2

nTraum
nTraum

Reputation: 1426

Parsing HTML with regular expressions is a bad idea, because HTML is not a regular language. Ideally, you want to parse the DOM / XML to a tree structure.

http://nokogiri.org/ is pretty popular.

Upvotes: 0

Related Questions