Slicekick
Slicekick

Reputation: 2159

Only parsing outer element

I am writing a scraper with Nokogiri, and I want to scrape a large HTML file.

Currently, I am scraping a large table; here is a small fragment:

<table id="rptBidTypes__ctl0_dgResults">
    <tr>
      <td align="left">S24327</td>

      <td>
        Airfield Lighting

        <div>
          <div>
          <table cellpadding="5px" border="2" cellspacing="1px" width="100%" bgcolor=
          "black">
              <tr>
                <td bgcolor="white">Abstract:<br />
                This project is for the purchase and delivery, of various airfield
                lighting, for a period of 36 months, with two optional 1 year renewals,
                in accordance with the specifications, terms and conditions specified in
                the solicitation.</td>
              </tr>
            </table>
          </div>
        </div>
      </td>
    </tr>
</table>

And here is the Ruby code I am using to scrape:

document = doc.search("table#rptBidTypes__ctl0_dgResults tr")
  document[1..-1].each do |v|
   cells = v.search 'td'
   if cells.inner_html.length > 0

     data = {
       number: cells[0].text,

     }
    end
    ScraperWiki::save_sqlite(['number'], data)
  end

Unfortunately this isn't working for me. I only want to extract S24327, but I am getting the content of every table cell. How do I only extract the content of the first td?

Keep in mind that under this table, there are many table rows following the same format.

Upvotes: 0

Views: 178

Answers (3)

Mark Thomas
Mark Thomas

Reputation: 37507

In CSS, table tr means tr anywhere underneath the table, including nested tables. But table > tr means the tr must be a direct child of the table.

Also, it appears you only want the cell values, so you don't need to iterate. This will give you all such cells (the first in each row):

doc.search("table#rptBidTypes__ctl0_dgResults > tr > td[1]").map(&:text)

Upvotes: 1

Chris Salzberg
Chris Salzberg

Reputation: 27374

The problem is that your search is matching two different things: the <tr> tag nested directly within the table with id rptBidTypes__ctl0_dgResults, and the <tr> tag within the table nested inside that parent table. When you loop through document[1..-1] you're actually selecting the second <tr> tag rather than the first one.

To select just the direct child <tr> tag, use:

document = doc.search("table#rptBidTypes__ctl0_dgResults > tr")

Then you can get the text for the <td> tag with:

document.css('td')[0].text   #=> "S24327"

Upvotes: 1

pguardiario
pguardiario

Reputation: 54984

The content of the first td would be:

doc.at("table#rptBidTypes__ctl0_dgResults td").text

Upvotes: 1

Related Questions