Reputation: 44958
I would like to parse a table using Nokogiri. I'm doing it this way
def parse_table_nokogiri(html)
doc = Nokogiri::HTML(html)
doc.search('table > tr').each do |row|
row.search('td/font/text()').each do |col|
p col.to_s
end
end
end
Some of the table that I have have rows like this:
<tr>
<td>
Some text
</td>
</tr>
...and some have this.
<tr>
<td>
<font> Some text </font>
</td>
</tr>
My XPath expression works for the second scenario but not the first. Is there an XPath expression that I could use that would give me the text from the innermost node of the cell so that I can handle both scenarios?
I've incorporated the changes into my snippet
def parse_table_nokogiri(html)
doc = Nokogiri::HTML(html)
table = doc.xpath('//table').max_by {|table| table.xpath('.//tr').length}
rows = table.search('tr')[1..-1]
rows.each do |row|
cells = row.search('td//text()').collect {|text| CGI.unescapeHTML(text.to_s.strip)}
cells.each do |col|
puts col
puts "_____________"
end
end
end
Upvotes: 3
Views: 6631
Reputation: 243459
Use:
td//text()[normalize-space()]
This selects all non-white-space-only text node descendents of any td
child of the current node (the tr
already selected in your code).
Or if you want to select all text-node descendents, regardles whether they are white-space-only or not:
td//text()
UPDATE:
The OP has signaled in a comment that he is getting an unwanted td
with content just a ' '
(aka non-breaking space).
To exclude also td
s whose content is composed only of (one or more) nbsp characters, use:
td//text()[translate(normalize-space(), ' ', '')]
Upvotes: 6
Reputation: 303136
Simple (but not DRY) way of using alternation:
require 'nokogiri'
doc = Nokogiri::HTML <<ENDHTML
<body><table><thead><tr><td>NOT THIS</td></tr></thead><tr>
<td>foo</td>
<td><font>bar</font></td>
</tr></table></body>
ENDHTML
p doc.xpath( '//table/tr/td/text()|//table/tr/td/font/text()' )
#=> [#<Nokogiri::XML::Text:0x80428814 "foo">,
#=> #<Nokogiri::XML::Text:0x804286fc "bar">]
See XPath with optional element in hierarchy for a more DRY answer.
In this case, however, you can simply do:
p doc.xpath( '//table/tr/td//text()' )
#=> [#<Nokogiri::XML::Text:0x80428814 "foo">,
#=> #<Nokogiri::XML::Text:0x804286fc "bar">]
Note that your table structure (and mine above) which does not have an explicit tbody
element is invalid for XHTML. Given your explicit table > tr
above, however, I assume that you have a reason for this.
Upvotes: 1
Reputation: 37507
Simple:
doc.search('//td').each do |cell|
puts cell.content
end
Upvotes: 3