Reputation: 568
I have an XML doc with the following format:
<document>
<page>
<column>
<text>
<par>
<line></line>
</par>
</text>
</column>
</page>
</document>
I want to search for a string in the XML, but might be in multiple line tags, multiple block docs, and/or multiple page tags:
<document>
<page>
<column>
<text>
<par>
<line>Hello</line>
</par>
</text>
</column>
<column>
<text>
<par>
<line>World</line>
</par>
</text>
</column>
</page>
<page>
<column>
<text>
<par>
<line>What's</line>
<line>Up?</line>
</par>
</text>
</column>
</page>
</document>
I need to search for "Hello World What's Up?" and know that it's on line 1 of column 1, line 1 of column 2, and lines 1-2 of block 3 (page 3 block 1).
I have metadata on the lines to tell me what line number it is, along with what column number it belongs to, for example:
<line linenum="1" columnnum="2">World</line>
What would be the best way to search for that term across different columns, and be able to know the details of what lines and columns they belong to?
I can get all instances of the first word, iterate on each and see if the following words correspond to the search words (word by word), and if there aren't any more words in that line, go to the next line. If there aren't anymore lines, go to the next block. Thoughts?
Here's a real snippet of an example XML code, and what the script is returning:
<block>
<text>
<par>
<line colnum="1" linenum="1">
(12) United States Patent
</line>
</par>
<par>
<line colnum="1" linenum="2">
Kar-Roy et al.
</line>
</par>
</text>
</block>
<block>
<text>
<par>
<line colnum="2" linenum="3">
US007078310B1
</line>
</par>
</text>
</block>
<block>
<text>
<par>
<line colnum="3" linenum="4">
(io) Patent No.: US 7,078,310 B1
</line>
</par>
<par>
<line colnum="3" linenum="5">
(45) Date of Patent: Jul. 18,2006
</line>
</par>
</text>
</block>
<block>
<text>
<par>
<line>
(54) METHOD FOR FABRICATING A HIGH
</line>
<line>
DENSITY COMPOSITE MIM CAPACITOR
</line>
</par>
</text>
</block>
When I search for "METHOD FOR FABRICATING A HIGH", map{|f| f.text}
returns:
["Kar-Roy et al.", "US007078310B1", "(io) Patent No.: US 7,078,310 B1", "(45) Date of Patent: Jul. 18,2006", "(54) METHOD FOR FABRICATING A HIGH"]
It looks like it's taking the five-word length, and getting the four lines before the actual result for some reason.
Upvotes: 4
Views: 1587
Reputation: 7522
Here's my thought: first, parse your structure into an XML parser like Nokogiri, and then use an XPath search to extract all the line
elements. Then, break each element into the words contained in that node, so we can match on phrases which only match part of a node. Then, order the words consecutively, use each_cons(4)
(where 4
is the number of words you're searching for) to look at all consecutive sets of four words, and return if they match your search string when concatenated. Here's my code to do so:
xml = Nokogiri::XML.parse(doc)
search = "HIGH DENSITY"
# 1. break down all the lines into words tagged with their nodes
# 2. find matching subsequence
# 3. build up from nodes
nodes = xml.xpath('//line')
words = nodes.map do |n|
words_in_node = n.text.split(' ').map(&:upcase) # split into words and normalize
words_in_node.map { |word| { word: word, node: n } }
end
words = words.flatten # at this point we have a single, ordered list like [ {word: "foo", node: ...}, {word: "bar", node: ...} ]
keywords = search.split(' ').map(&:upcase)
result = words.each_cons(keywords.size).find do |sample|
# Extract just the :word key from each hash, then compare to our search string
sample_words = sample.map { |w| w[:word] }
sample_words == keywords
end
if result
puts "Found in these nodes:"
puts result.map { |w| w[:node] }.uniq.inspect
# you can find where each node was located via Nokogiri
else
puts "No match"
end
Which produces:
Found in these nodes:
[#<Nokogiri::XML::Element:0x4ea323e name="line" children=[#<Nokogiri::XML::Text:0x4ea294c "\n (54) METHOD FOR FABRICATING A HIGH\n ">]>,
#<Nokogiri::XML::Element:0x4ea3018 name="line" children=[#<Nokogiri::XML::Text:0x4ea2654 "\n DENSITY COMPOSITE MIM CAPACITOR\n ">]>]
Upvotes: 2
Reputation: 160551
If I understand what you want, I'd go about it like this:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<document>
<page>
<column>
<text>
<par>
<line linenum="1" columnnum="1">Hello</line>
</par>
</text>
</column>
<column>
<text>
<par>
<line linenum="1" columnnum="2">World</line>
</par>
</text>
</column>
</page>
<page>
<column>
<text>
<par>
<line linenum="1" columnnum="3">What's</line>
<line linenum="2" columnnum="3">Up?</line>
</par>
</text>
</column>
</page>
</document>
EOT
line_text = doc.search('column').map { |column|
column.search('line').map{ |line|
{
line: line['linenum'],
column: line['columnnum'],
text: line.text
}
}
}
At this point line_text
contains:
line_text
# => [[{:line=>"1", :column=>"1", :text=>"Hello"}],
# [{:line=>"1", :column=>"2", :text=>"World"}],
# [{:line=>"1", :column=>"3", :text=>"What's"},
# {:line=>"2", :column=>"3", :text=>"Up?"}]]
This is grouping by <column>
. The metadata isn't necessary, but it's convenient if it exists in the XML. If it doesn't, remove the lines to capture those parameters and only return the text:
line_text = doc.search('column').map { |column|
column.search('line').map{ |line|
line.text
}
}
line_text
# => [["Hello"], ["World"], ["What's", "Up?"]]
line_text
is now an array of arrays. Each element in the outer array signifies a column, and the elements inside that sub-array are the lines, so you could keep track of things that way with a much smaller returned array along with a bit of extra code:
line_text.each.with_index(1) do |column, column_num|
column.each.with_index(1) do |text, line_num|
puts "column: #{column_num} line: #{line_num} text: #{text}"
end
end
# >> column: 1 line: 1 text: Hello
# >> column: 2 line: 1 text: World
# >> column: 3 line: 1 text: What's
# >> column: 3 line: 2 text: Up?
Upvotes: 1