Reputation: 1048

How to parse inner_html inside for loop using XPath with nokogiri

I'm having troubles parsing inside a for loop of only the inner_html that I have found. I want to use XPath again inside that content only. I'm new to ruby so better solutions are on the table.

#!/usr/bin/ruby -w

require 'rubygems'
require 'nokogiri'

page1 = Nokogiri::HTML(open('mycontacts.html'))


# Search for nodes by xpath
page1.xpath('//html/body/form/div[2]/span/table/tbody/tr').each do |row|
  #puts a_tag.content
  puts "new row"
  row_html = row.inner_html

  puts row_html
  puts ""

  name = row_html.xpath("/td[1]").text
  puts "name is " + name

end

My output of each row in the for loop is something like:

new row
<th>First Name</th>
<th>Last Name</th>
<th>Phone</th>

Here's the error that I'm getting:

screen-scraper.rb:20:in block in <main>': undefined methodxpath' for # (NoMethodError)

I want to parse each tr and get data like: Barney Rubble, Fred Flintstone

<table>
    <tbody>
        <tr>
            <th>First Name</th>
            <th>Last Name</th>
        </tr>
        <tr>
            <td>Fred</td>
            <td>Flintstone</td>
        </tr>
        <tr>
            <td>Barney</td>
            <td>Rubble</td>
        </tr>
    </tbody>
</table>

I'm open to suggestions. I was thinking it's easier to parse only the inner_html inside the for loop, but if there's an easier way to get at the node within the for loop, that would work as well.

Thanks....

Upvotes: 2

Answers (3)

the Tin Man

Reputation: 160551

...I've noticed that Firebug produces some xpath expressions that don't work well with Nokogiri (or its dependency). I'm having better luck with Chrome's Debug XPath output.

The problem with Firebug, or many other XPath outputs from a browser, is they follow the HTML spec when generating the XPath and synthesize a <tbody> tag, even if the original source doesn't have one. The XPath reflects that.

We pass the raw HTML to Nokogiri for parsing, along with the erroneous XPath, and Nokogiri can't find the <table><tbody><tr> chain.

Here's a for instance. Starting with this HTML:

<html>
  <body>
    <table>
      <tr>
        <td>
          foo
        </td>
      </tr>
    </table>
  </body>
</html>

Save it to a file and open it in Firefox, Chrome or Safari, then view the source, and look at it in Firebug or its equivalent.

You'll see something like this, which came from Firefox:

<table>
  <tbody><tr>
    <td>
      foo
    </td>
  </tr>
</tbody></table>

To fix this, don't rely on the XPath generated by the browser, and confirm the table's structure by looking at only the RAW HTML in a text editor. The "view source" option is useful for some things, but if you see any <tbody> tags be suspicious and revert to checking with the editor.

Also, you don't need the entire chain of tags to reach an inner tag. Instead, look for some landmarks along the way that will help you find your target node(s). Most HTML pages these days have class and id parameters in important tags. ID parameters in particular are great because they have to be unique. If other parameters exist that are unique, those can work too.

Sometimes you won't find an identifying tag immediately prior to the one you want, but there is something embedded in it. Then, locate that embedded tag and step up the chain until you find what you want. Using XPath you can use the .. (parent), but with CSS you have to rely on Nokogiri::XML::Node's parent method because Nokogiri and CSS don't support a selector for the parent (yet).

Upvotes: 1

robertjlooby

Reputation: 7220

The problem is that row_html, obtained by Nokogiri::XML::Node#inner_html, is just a String. To call xpath on it again, you must first parse the string again with Nokogiri using Nokogiri::HTML(row_html).

A better way though would be to never call inner_html in the first place, leave row as a Nokogiri::XML::Node, and then call row.xpath(...).

For example, with a table like you provided and output you wanted:

page1.xpath('//html/body/form/div[2]/span/table/tbody/tr').each do |row|
    puts "#{row.children[0].text} #{row.children[1].text}"
end

Upvotes: 1

Arup Rakshit

Reputation: 118261

You could fix it instead of using name = row_html.xpath("/td[1]").text,use name = Nokogiri::HTML(row_html).xpath("/td[1]").text. Although there is a good technique of doing so if you share the full HTML you have with you.

Nokogiri::HTML(row_html) will give you the instance of the class Nokogiri::HTML::Document. Now #xpath,#css and #search all the methods are the instance method of Nokogiri::HTML::Document class.

Considering that if your inner_html produces the HTML table you provided,then you can think of as below.

I did test the code,and hope it would give you the result:

require "nokogiri"

doc = Nokogiri::HTML(<<-eohl)
<table>
    <tbody>
        <tr>
            <th>First Name</th>
            <th>Last Name</th>
        </tr>
        <tr>
            <td>Fred</td>
            <td>Flintstone</td>
        </tr>
        <tr>
            <td>Barney</td>
            <td>Rubble</td>
        </tr>
    </tbody>
</table>
eohl

doc.css("table > tbody > tr"). each do |nd|
 nd.children.each{|i| print i.text.strip,"  " unless i.text.strip == "" }
 print "\n"
end
# >> First Name  Last Name  
# >> Fred  Flintstone  
# >> Barney  Rubble

Now see here what #inner_html gives,which inturn will answer you why you got that no such method error:

require "nokogiri"

doc = Nokogiri::HTML(<<-eohl)
<table>
    <tbody>
        <tr>
            <th>First Name</th>
            <th>Last Name</th>
        </tr>
        <tr>
            <td>Fred</td>
            <td>Flintstone</td>
        </tr>
        <tr>
            <td>Barney</td>
            <td>Rubble</td>
        </tr>
    </tbody>
</table>
eohl

doc.search("table > tbody > tr"). each do |nd|
 p nd.inner_html.class
end

# >> String
# >> String
# >> String

Upvotes: 1

How to parse inner_html inside for loop using XPath with nokogiri

Answers (3)

Related Questions