tengee
tengee

Reputation: 103

ruby nokogiri HTML table scraping using xpath

I am trying to get "cell4" value that is written in a HTML table like the following using ruby xpath and nokogiri:

<html>
<body>

<h1>Heading</h1>

<p>paragraph.</p>

<h4>Two rows and three columns:</h4>
<table border="0">
<tr>
  <td>cell1</td>
  <td>cell2</td>
</tr>
<tr>
  <td>cell3</td>
  <td>cell4</td>
</tr>

</table>

</body>
</html>

I have the following simple code but it brings []. This question must be simple enough but couldnt find anything that hits the spot on the google

#!/usr/bin/ruby -w

require 'rubygems'
require 'nokogiri'

page1 = Nokogiri::HTML('test_simple.html')

a = page1.xpath("//html/body/table/tr[2]/td[2]")
p a

the xpath works as intended on REXML therefore it is correct, but does not on nokogiri. Since this is going to be used for larger htmls REXML cannot be used. The problem does not seem to be only with the tables only other tag contents

or cannot be scraped as well.

Upvotes: 3

Views: 4156

Answers (2)

tengee
tengee

Reputation: 103

thanks to taro`s comment, I was able to solve the issue with some little effort

Here goes the correct code:

#!/usr/bin/ruby -w
require 'rubygems'
require 'nokogiri'
page1 = Nokogiri::HTML(open('test_simple.html'))
a = page1.xpath("/html/body/table/tr[2]/td[2]").text
p a

Upvotes: 4

Matt
Matt

Reputation: 17629

IMHO it is a lot asier to work with the CSS API in Nokogiri (XPath is always giving me headaches):

page.css('td') # should return an array of 4 table cell nodes
page.css('td')[3] # return the 4th 'td' node, counting starts at 0

Upvotes: 7

Related Questions