Reputation: 25425
I'm using Nokogiri (v 1.6.6) in Ruby (v 2.2) to scrape data from HTML files. The target data is in <p>
elements as shown below. I'm able to slurp all the text content with:
require 'nokogiri'
doc = Nokogiri::HTML(DATA.read)
doc.css("div.listing > p").each do |p|
puts p.text
end
__END__
<div class="listing">
<p><span>1</span> Details1 <span>info1</span></p>
<p><span>2</span> Details2 <span>info2</span></p>
<p><span>3</span> Details3 <span>info3</span></p>
</div>
Which returns:
1 Details1 info1
2 Details2 info2
3 Details3 info3
While I can easily parse out the text inside the <span>
tags, I haven't figured out how to get the "Details#" text between them. It's easy enough to do via a regex, but I'd like to see if there's a way to do it directly from Nokigiri. The goal is to return:
Details1
Details2
Details3
Is that possible using Nokogiri's built in functionality?
Upvotes: 0
Views: 1103
Reputation: 25425
Here's what I ended up with:
doc.css("div.listing > p").each do |p|
puts p.at_xpath('./text()').text.strip
end
According to "Get text directly inside a tag in Nokogiri", the text()
method will
get all the direct children with text, but not any further sub-children
That's the behavior I'm seeing and it produced the expected results.
Upvotes: 0
Reputation: 626
I think that if you dive a little into "Getting Mugged by Nokogiri" you will find the answer, however I'll give my approach to your question:
irb(main):061:0> doc = Nokogiri::HTML("<div class='listing'> <p><span>1</span> Details1 <span>info1</span></p> <p><span>2</span> Details2 <span>info2</span></p> <p><span>3</span> Details3 <span>info3</span></p> </div>")
That will give you a Nokogiri object called doc
:
=> #<Nokogiri::HTML::Document:0x2ab03653f26c name="document" children=[#<Nokogiri::XML::DTD:0x2ab03653ef4c name="html">, #<Nokogiri::XML::Element:0x2ab03653ece0 name="html" children=[#<Nokogiri::XML::Element:0x2ab03653eb00 name="body" children=[#<Nokogiri::XML::Element:0x2ab03653e920 name="div" attributes=[#<Nokogiri::XML::Attr:0x2ab03653e8bc name="class" value="listing">] children=[#<Nokogiri::XML::Text:0x2ab03653e484 " ">, #<Nokogiri::XML::Element:0x2ab03653e3d0 name="p" children=[#<Nokogiri::XML::Element:0x2ab03653e1f0 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653e010 "1">]>, #<Nokogiri::XML::Text:0x2ab03653de58 " Details1 ">, #<Nokogiri::XML::Element:0x2ab03653dda4 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653db9c "info1">]>]>, #<Nokogiri::XML::Text:0x2ab03653d8f4 " ">, #<Nokogiri::XML::Element:0x2ab03653d840 name="p" children=[#<Nokogiri::XML::Element:0x2ab03653d660 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653d480 "2">]>, #<Nokogiri::XML::Text:0x2ab03653d2dc " Details2 ">, #<Nokogiri::XML::Element:0x2ab03653d228 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653d048 "info2">]>]>, #<Nokogiri::XML::Text:0x2ab03653cdb4 " ">, #<Nokogiri::XML::Element:0x2ab03653cd00 name="p" children=[#<Nokogiri::XML::Element:0x2ab03653cb20 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653c940 "3">]>, #<Nokogiri::XML::Text:0x2ab03653c79c " Details3 ">, #<Nokogiri::XML::Element:0x2ab03653c6e8 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653c508 "info3">]>]>, #<Nokogiri::XML::Text:0x2ab03653c274 " ">]>]>]>]>
And then you will be able to iterate over the object:
"Traverse method walks through all of a node’s children recursively. We check whether the node is a text node and if its parent node is a paragraph."
irb(main):068:0> doc.at_css("body").traverse do |node|
irb(main):069:1* if node.text? && (node.parent.name == "p")
irb(main):070:2> puts node.content
irb(main):071:2> end
irb(main):072:1> end
Details1
Details2
Details3
=> nil
irb(main):073:0>
I have to say that I didn't know about traverse
so your question was really helpful for me as I use Nokogiri daily. I hope you find this answer useful.
Upvotes: 1