Hpricot XML text search

Question

Hpricot + Ruby XML parsing and logical selection.

Objective: Find all title written by author Bob.

My XML file:




Book1
march 1 2010
Bob



book2
october 4 2009
Bill



book3
June 5 2010
Steve




#my Hpricot, running this code returns no output, however the search pattern works on its own.
 (doc % :rss % :channel / :item).each do |item|

        a=item.search("author[text()*='Bob']")

        #puts "FOUND" if a.include?"Bob"
        puts item.at("title") if a.include?"Bob"

  end

the Tin Man · Accepted Answer

One of the ideas behind XPath is it allows us to navigate a DOM similarly to a disk directory:

require 'hpricot'

xml = <
    
        
            Book1
            march 1 2010
            Bob
        

        
            book2
            october 4 2009
            Bill
        

        
            book3
            June 5 2010
            Steve
        

        
            Book4
            march 1 2010
            Bob
        

    

EOT

doc = Hpricot(xml)

titles = (doc / '//author[text()="Bob"]/../title' )
titles # => # "Book1" }, {elem  "Book4" }]>

That means: "find all the books by Bob, then look up one level and find the title tag".

I added an extra book by "Bob" to test getting all occurrences.

To get the item containing a book by Bob, just move back up a level:

items = (doc / '//author[text()="Bob"]/..' )
puts items # => nil
# >> 
# >>             Book1
# >>             march 1 2010
# >>             Bob
# >>         
# >> 
# >>             Book4
# >>             march 1 2010
# >>             Bob
# >>

I also figured out what (doc % :rss % :channel / :item) is doing. It's equivalent to nesting the searches, minus the wrapping parenthesis, and these should all be the same in Hpricot-ese:

(doc % :rss % :channel / :item).size # => 4
(((doc % :rss) % :channel) / :item).size # => 4
(doc / '//rss/channel/item').size # => 4
(doc / 'rss channel item').size # => 4

Because '//rss/channel/item' is how you'd normally see an XPath accessor, and 'rss channel item' is a CSS accessor, I'd recommend using those formats for maintenance and clarity.

Hpricot XML text search

Answers (2)

Related Questions