XML searching line by line

Question

I have an XML doc with the following format:

I want to search for a string in the XML, but might be in multiple line tags, multiple block docs, and/or multiple page tags:

I need to search for "Hello World What's Up?" and know that it's on line 1 of column 1, line 1 of column 2, and lines 1-2 of block 3 (page 3 block 1).

I have metadata on the lines to tell me what line number it is, along with what column number it belongs to, for example:

World

What would be the best way to search for that term across different columns, and be able to know the details of what lines and columns they belong to?

I can get all instances of the first word, iterate on each and see if the following words correspond to the search words (word by word), and if there aren't any more words in that line, go to the next line. If there aren't anymore lines, go to the next block. Thoughts?

Here's a real snippet of an example XML code, and what the script is returning:


  
    
      
        (12) United States Patent
      
    
    
      
        Kar-Roy et al.
      
    
  


  
    
      
        US007078310B1
      
    
  


  
    
      
        (io) Patent No.: US 7,078,310 B1
      
    
    
      
        (45) Date of Patent: Jul. 18,2006
      
    
  


  
    
      
        (54) METHOD FOR FABRICATING A HIGH
      
      
        DENSITY COMPOSITE MIM CAPACITOR

When I search for "METHOD FOR FABRICATING A HIGH", map{|f| f.text} returns:

["Kar-Roy et al.", "US007078310B1", "(io) Patent No.: US 7,078,310 B1", "(45) Date of Patent: Jul. 18,2006", "(54) METHOD FOR FABRICATING A HIGH"]

It looks like it's taking the five-word length, and getting the four lines before the actual result for some reason.

Robert Nubel · Accepted Answer

Here's my thought: first, parse your structure into an XML parser like Nokogiri, and then use an XPath search to extract all the line elements. Then, break each element into the words contained in that node, so we can match on phrases which only match part of a node. Then, order the words consecutively, use each_cons(4) (where 4 is the number of words you're searching for) to look at all consecutive sets of four words, and return if they match your search string when concatenated. Here's my code to do so:

xml = Nokogiri::XML.parse(doc)

search = "HIGH DENSITY"

# 1. break down all the lines into words tagged with their nodes
# 2. find matching subsequence
# 3. build up from nodes

nodes = xml.xpath('//line')
words = nodes.map do |n|
  words_in_node = n.text.split(' ').map(&:upcase) # split into words and normalize
  words_in_node.map { |word| { word: word, node: n } }
end
words = words.flatten # at this point we have a single, ordered list like [ {word: "foo", node: ...}, {word: "bar", node: ...} ]

keywords = search.split(' ').map(&:upcase)
result = words.each_cons(keywords.size).find do |sample|
  # Extract just the :word key from each hash, then compare to our search string
  sample_words = sample.map { |w| w[:word] }
  sample_words == keywords
end

if result
  puts "Found in these nodes:"
  puts result.map { |w| w[:node] }.uniq.inspect
  # you can find where each node was located via Nokogiri
else
  puts "No match"
end

Which produces:

Found in these nodes:
[#]>,
 #]>]

XML searching line by line

Answers (2)

Related Questions