oort
oort

Reputation: 1870

Exclude Nested Elements While Doing Tree Traversal

I am using Nokogiri to parse an XML file that has (roughly) the following structure:

<diag>
  <name>A00</name>
  <desc>Cholera</desc>
  <diag>
    <name>A00.0</name>
    <desc>Cholera due to Vibrio cholerae 01, biovar cholerae</desc>
  </diag>
  <diag>
    ...
  </diag>
  ...
</diag>

As you can see this tree has diag nodes that can be nested arbitrarily deep, yet each nesting is a more specific description of the parent node.

I want to "flatten" this tree so that rather than having A00.0 nested within A00 I can just have a list going something like

A00
A00.0
A00.1
...
A00.34
...
A01
...

What I have so far looks like this:

require 'nokogiri'
icd10 = File.new("icd10.xml", "r")
doc = Nokogiri::XML(icd10.read) do |config|
  config.strict.noblanks
end
icd10.close

@diags = {}
@diag_count = 0

def get_diags(node)
  node.children.each do |n|
    if n.name == "diag"
      @diags[@diag_count] = n
      @diag_count += 1
      get_diags(n)
    end
  end
end

# The xml file has sections but what I really want are the contents of the sections
doc.xpath('.//section').each do |n|
  get_diags(n)
end

So far this works in that I do get all the diag elements within the file, but the problem is that the parent nodes still contain all the content that is found in later nodes (e.g. @diags[0] contains the A00, A00.0, A00.1, etc. nodes while @diags[1] contains just the A00.0 content).

How can I exclude nested elements from the parent element while traversing the xml content in get_diags? Thanks in advance!

== EDIT ==

So I added this to my get_diags method

def get_diags(node)
  node.children.each do |n|
    if n.name == "diag"
      f = Nokogiri::XML.fragment(n.to_s)
      f.search('.//diag').children.each do |d|
        if d.name == "diag"
          d.remove
        end
      end
      @diags[@diag_count] = f
      @diag_count += 1
      get_diags(n)
    end
  end
end

Now @diags holds a fragment of xml where all the nested <diag>...</diag> are removed, which in one sense is what I want, but overall this is very very ugly, and I was wondering if anyone could share a better way to go about this. Thanks

Upvotes: 0

Views: 202

Answers (1)

Wayne Conrad
Wayne Conrad

Reputation: 108049

The xpath '//diag' will give you each <diag> node, in turn, no matter how deeply nested. Then you can just extract the text values of each node's name and desc children:

diags = doc.xpath('//diag').map do |diag|
  Hash[
    %w(name desc).map do |key|
      [key, diag.xpath(key).text]
    end
  ]
end
pp diags
# => [{"desc"=>"Cholera", "name"=>"A00"},
# =>  {"desc"=>"Cholera due to Vibrio cholerae 01, biovar cholerae",
# =>   "name"=>"A00.0"}]

If you wish to create a new XML tree with a different structure, I wouldn't bother trying to transform the original. Just take the extracted data and use it to create the new tree:

builder = Nokogiri::XML::Builder.new do |xml|
  xml.diagnoses do
  diags.each do |diag|
    xml.diag {
      xml.name = diag['name']
      xml.desc = diag['desc']
    }
  end
  end
end
puts builder.to_xml
# => <?xml version="1.0"?>
# => <diagnoses>
# =>   <diag>
# =>     <name=>A00</name=>
# =>     <desc=>Cholera</desc=>
# =>   </diag>
# =>   <diag>
# =>     <name=>A00.0</name=>
# =>     <desc=>Cholera due to Vibrio cholerae 01, biovar cholerae</desc=>
# =>   </diag>
# => </diagnoses>

Upvotes: 2

Related Questions