Reputation: 2159
I am writing a scraper with Nokogiri, and I want to scrape a large HTML file.
Currently, I am scraping a large table; here is a small fragment:
<table id="rptBidTypes__ctl0_dgResults">
<tr>
<td align="left">S24327</td>
<td>
Airfield Lighting
<div>
<div>
<table cellpadding="5px" border="2" cellspacing="1px" width="100%" bgcolor=
"black">
<tr>
<td bgcolor="white">Abstract:<br />
This project is for the purchase and delivery, of various airfield
lighting, for a period of 36 months, with two optional 1 year renewals,
in accordance with the specifications, terms and conditions specified in
the solicitation.</td>
</tr>
</table>
</div>
</div>
</td>
</tr>
</table>
And here is the Ruby code I am using to scrape:
document = doc.search("table#rptBidTypes__ctl0_dgResults tr")
document[1..-1].each do |v|
cells = v.search 'td'
if cells.inner_html.length > 0
data = {
number: cells[0].text,
}
end
ScraperWiki::save_sqlite(['number'], data)
end
Unfortunately this isn't working for me. I only want to extract S24327
, but I am getting the content of every table cell. How do I only extract the content of the first td
?
Keep in mind that under this table, there are many table rows following the same format.
Upvotes: 0
Views: 178
Reputation: 37507
In CSS, table tr
means tr
anywhere underneath the table, including nested tables. But table > tr
means the tr
must be a direct child of the table
.
Also, it appears you only want the cell values, so you don't need to iterate. This will give you all such cells (the first in each row):
doc.search("table#rptBidTypes__ctl0_dgResults > tr > td[1]").map(&:text)
Upvotes: 1
Reputation: 27374
The problem is that your search is matching two different things: the <tr>
tag nested directly within the table with id rptBidTypes__ctl0_dgResults
, and the <tr>
tag within the table nested inside that parent table. When you loop through document[1..-1]
you're actually selecting the second <tr>
tag rather than the first one.
To select just the direct child <tr>
tag, use:
document = doc.search("table#rptBidTypes__ctl0_dgResults > tr")
Then you can get the text for the <td>
tag with:
document.css('td')[0].text #=> "S24327"
Upvotes: 1
Reputation: 54984
The content of the first td would be:
doc.at("table#rptBidTypes__ctl0_dgResults td").text
Upvotes: 1