Alan W. Smith
Alan W. Smith

Reputation: 25425

Extract text between nodes with Nokogiri in a Ruby script

I'm using Nokogiri (v 1.6.6) in Ruby (v 2.2) to scrape data from HTML files. The target data is in <p> elements as shown below. I'm able to slurp all the text content with:

require 'nokogiri'

doc = Nokogiri::HTML(DATA.read)

doc.css("div.listing > p").each do |p|
  puts p.text
end

__END__
<div class="listing">
  <p><span>1</span> Details1 <span>info1</span></p>
  <p><span>2</span> Details2 <span>info2</span></p>
  <p><span>3</span> Details3 <span>info3</span></p>
</div>

Which returns:

1 Details1 info1
2 Details2 info2
3 Details3 info3

While I can easily parse out the text inside the <span> tags, I haven't figured out how to get the "Details#" text between them. It's easy enough to do via a regex, but I'd like to see if there's a way to do it directly from Nokigiri. The goal is to return:

Details1
Details2
Details3

Is that possible using Nokogiri's built in functionality?

Upvotes: 0

Views: 1103

Answers (2)

Alan W. Smith
Alan W. Smith

Reputation: 25425

Here's what I ended up with:

doc.css("div.listing > p").each do |p|
  puts p.at_xpath('./text()').text.strip
end

According to "Get text directly inside a tag in Nokogiri", the text() method will

get all the direct children with text, but not any further sub-children

That's the behavior I'm seeing and it produced the expected results.

Upvotes: 0

nisevi
nisevi

Reputation: 626

I think that if you dive a little into "Getting Mugged by Nokogiri" you will find the answer, however I'll give my approach to your question:

irb(main):061:0> doc = Nokogiri::HTML("<div class='listing'> <p><span>1</span> Details1 <span>info1</span></p> <p><span>2</span> Details2 <span>info2</span></p> <p><span>3</span> Details3 <span>info3</span></p> </div>")

That will give you a Nokogiri object called doc:

=> #<Nokogiri::HTML::Document:0x2ab03653f26c name="document" children=[#<Nokogiri::XML::DTD:0x2ab03653ef4c name="html">, #<Nokogiri::XML::Element:0x2ab03653ece0 name="html" children=[#<Nokogiri::XML::Element:0x2ab03653eb00 name="body" children=[#<Nokogiri::XML::Element:0x2ab03653e920 name="div" attributes=[#<Nokogiri::XML::Attr:0x2ab03653e8bc name="class" value="listing">] children=[#<Nokogiri::XML::Text:0x2ab03653e484 " ">, #<Nokogiri::XML::Element:0x2ab03653e3d0 name="p" children=[#<Nokogiri::XML::Element:0x2ab03653e1f0 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653e010 "1">]>, #<Nokogiri::XML::Text:0x2ab03653de58 " Details1 ">, #<Nokogiri::XML::Element:0x2ab03653dda4 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653db9c "info1">]>]>, #<Nokogiri::XML::Text:0x2ab03653d8f4 " ">, #<Nokogiri::XML::Element:0x2ab03653d840 name="p" children=[#<Nokogiri::XML::Element:0x2ab03653d660 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653d480 "2">]>, #<Nokogiri::XML::Text:0x2ab03653d2dc " Details2 ">, #<Nokogiri::XML::Element:0x2ab03653d228 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653d048 "info2">]>]>, #<Nokogiri::XML::Text:0x2ab03653cdb4 " ">, #<Nokogiri::XML::Element:0x2ab03653cd00 name="p" children=[#<Nokogiri::XML::Element:0x2ab03653cb20 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653c940 "3">]>, #<Nokogiri::XML::Text:0x2ab03653c79c " Details3 ">, #<Nokogiri::XML::Element:0x2ab03653c6e8 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653c508 "info3">]>]>, #<Nokogiri::XML::Text:0x2ab03653c274 " ">]>]>]>]>

And then you will be able to iterate over the object:

"Traverse method walks through all of a node’s children recursively. We check whether the node is a text node and if its parent node is a paragraph."

irb(main):068:0> doc.at_css("body").traverse do |node|
irb(main):069:1*   if node.text? && (node.parent.name == "p")
irb(main):070:2>     puts node.content
irb(main):071:2>   end
irb(main):072:1> end
Details1 
Details2 
Details3 
=> nil
irb(main):073:0>

I have to say that I didn't know about traverse so your question was really helpful for me as I use Nokogiri daily. I hope you find this answer useful.

Upvotes: 1

Related Questions