Reputation: 2283
Hpricot + Ruby XML parsing and logical selection.
Objective: Find all title written by author Bob.
My XML file:
<rss>
<channel>
<item>
<title>Book1</title>
<pubDate>march 1 2010</pubDate>
<author>Bob</author>
</item>
<item>
<title>book2</title>
<pubDate>october 4 2009</pubDate>
<author>Bill</author>
</item>
<item>
<title>book3</title>
<pubDate>June 5 2010</pubDate>
<author>Steve</author>
</item>
</channel>
</rss>
#my Hpricot, running this code returns no output, however the search pattern works on its own.
(doc % :rss % :channel / :item).each do |item|
a=item.search("author[text()*='Bob']")
#puts "FOUND" if a.include?"Bob"
puts item.at("title") if a.include?"Bob"
end
Upvotes: 1
Views: 562
Reputation: 160551
One of the ideas behind XPath is it allows us to navigate a DOM similarly to a disk directory:
require 'hpricot'
xml = <<EOT
<rss>
<channel>
<item>
<title>Book1</title>
<pubDate>march 1 2010</pubDate>
<author>Bob</author>
</item>
<item>
<title>book2</title>
<pubDate>october 4 2009</pubDate>
<author>Bill</author>
</item>
<item>
<title>book3</title>
<pubDate>June 5 2010</pubDate>
<author>Steve</author>
</item>
<item>
<title>Book4</title>
<pubDate>march 1 2010</pubDate>
<author>Bob</author>
</item>
</channel>
</rss>
EOT
doc = Hpricot(xml)
titles = (doc / '//author[text()="Bob"]/../title' )
titles # => #<Hpricot::Elements[{elem <title> "Book1" </title>}, {elem <title> "Book4" </title>}]>
That means: "find all the books by Bob, then look up one level and find the title tag".
I added an extra book by "Bob" to test getting all occurrences.
To get the item containing a book by Bob, just move back up a level:
items = (doc / '//author[text()="Bob"]/..' )
puts items # => nil
# >> <item>
# >> <title>Book1</title>
# >> <pubdate>march 1 2010</pubdate>
# >> <author>Bob</author>
# >> </item>
# >> <item>
# >> <title>Book4</title>
# >> <pubdate>march 1 2010</pubdate>
# >> <author>Bob</author>
# >> </item>
I also figured out what (doc % :rss % :channel / :item)
is doing. It's equivalent to nesting the searches, minus the wrapping parenthesis, and these should all be the same in Hpricot-ese:
(doc % :rss % :channel / :item).size # => 4
(((doc % :rss) % :channel) / :item).size # => 4
(doc / '//rss/channel/item').size # => 4
(doc / 'rss channel item').size # => 4
Because '//rss/channel/item'
is how you'd normally see an XPath accessor, and 'rss channel item'
is a CSS accessor, I'd recommend using those formats for maintenance and clarity.
Upvotes: 2
Reputation: 303261
If you're not set on Hpricot, here's one way to do this with XPath in Nokogiri:
require 'nokogiri'
doc = Nokogiri::XML( my_rss_string )
bobs_titles = doc.xpath("//title[parent::item/author[text()='Bob']]")
p bobs_titles.map{ |node| node.text }
#=> ["Book1"]
Edit: @theTinMan's XPath also works well, is more readable, and may very well be faster:
bobs_titles = doc.xpath("//author[text()='Bob']/../title")
Upvotes: 3