abn
abn

Reputation: 1373

Merge XML files wherever the attribute ID is the same Python

I have two XML files which I'm trying to merge.

XML1:

<hierachyAttributes>
    <attribute>
        <displayOrder>2</displayOrder>
        <attributeID>Demographics</attributeID>
        <children>
            <attribute>
                <displayOrder>1</displayOrder>
                <attributeID>age</attributeID>
        </children>
    </attribute>
</hierachyAttributes>

XML2:

<diseaseAttributes>
    <diseaseName>Cancer</diseaseName>
    <diseaseID>1322843</diseaseID>
    <metaAttributes>
        <attribute>
            <description>Age</description>
            <displayName>Age (years)</displayName>
            <attributeID>age</attributeID>
            <type>Double</type>
            <attributeCategory>Clinical</attributeCategory>
            <displayInSummary>TRUE</displayInSummary>
                <group>
                    <displayOrder>1</displayOrder>
                    <displayName>0 - &lt; 10</displayName>
                    <minValue>0</minValue>
                    <minInclusive>TRUE</minInclusive>
                    <maxValue>10</maxValue>
                    <maxInclusive>FALSE</maxInclusive>
                </group>
            </valueGroups>
        </attribute>
    </metaAttributes>
</diseaseAttributes>

Is there a way to merge them like below even with different root tags, in this case hierachyAttributes and diseaseAttributes? CombinedXML:

<hierachyAttributes>
<diseaseAttributes>
    <diseaseName>Cancer</diseaseName>
    <diseaseID>1322843</diseaseID>
    <metaAttributes>
        <attribute>
        <displayOrder>2</displayOrder>
        <attributeID>Demographics</attributeID>
        <children>
            <attribute>
                <displayOrder>1</displayOrder>
                <attributeID>age</attributeID>
                <description>Age</description>
                <displayName>Age (years)</displayName>
                <type>Double</type>
                <attributeCategory>Clinical</attributeCategory>
                <displayInSummary>TRUE</displayInSummary>
                    <group>
                        <displayOrder>1</displayOrder>
                        <displayName>0 - &lt; 10</displayName>
                        <minValue>0</minValue>
                        <minInclusive>TRUE</minInclusive>
                        <maxValue>10</maxValue>
                        <maxInclusive>FALSE</maxInclusive>
                    </group>
                </valueGroups>
            </attribute>
        </children>
    </metaAttributes>
</diseaseAttributes>
</hierachyAttributes>

i.e., merge them wherever the attributeID is same. I tried the following but it concatenated one xml file after another.

#!/usr/bin/env python
import sys
from xml.etree import ElementTree

def run(files):
    first = None
    for filename in files:
        data = ElementTree.parse(filename).getroot()
        if first is None:
            first = data
        else:
            first.extend(data)
    if first is not None:
        print ElementTree.tostring(first)

if __name__ == "__main__":
    run(sys.argv[1:])           

Or if the tag is replaced by and I want the same output but under one root node, i.e., diseaseAttributes, how can I achieve that?

Upvotes: 4

Views: 816

Answers (2)

spiralx
spiralx

Reputation: 1065

I think what you want to do is best done by installing the lxml module using

pip install lxml

and using it for any XML-related code, as it's so much better to use than the built-in stuff. Have a look at the tutorial, there are plenty of ways to do something where you load, parse and process attribte elments in each file all in one process.

There's more useful information at

Python XML processing with lxml

Upvotes: 0

spiralx
spiralx

Reputation: 1065

Your first XML file is missing a closing </attribute> tag under <children>. They're also just absolutely awful in terms of structure - ridiculously verbose and confusingly named so that I actually don't think I can tell what you're trying to do.

The first file looks as though it's just expressing a tree of relationships of "attributes". It's the second that I dont' get - it appears to contain the definition of an attribute "Age" with name, what type of data it is, but it's part of the underneath "Cancer". Why? My guess is that you're going to display results broken down by age, but why is Age tied into Cancer? What happens if you have Age data for e.g. winter deaths from influenza, does that have it's own unique Age attribute?

Actually, my first question... would this be how XML2 should work:

<disease-definitions>
  <disease-definition id="1322843">
    <name>Cancer</name>

    <attribute-definitions>
      <attribute id="age" category="Clinical">
        <description>Age</description>
        <displayName>Age (years)</name>
        <type>Double</type>

        <attribute-summary displayed="true">
          <group>
            <displayName>&lt; 10</displayName>
            <range type="half-open">
              <min>0</min>
              <max>10</max>
            </range>
          </group>
          <group>
            <displayName>10 - 20</displayName>
            <range type="half-open">
              <min>10</min>
              <max>20</max>
            </range>
          </group>
        </attribute-summary>
      </attribute>
    </attribute-definitions>
  </disease-definition>

  <disease-definition id="1322844">
    <name>Influenza</name>

    <attribute-definitions>
      <attribute id="age" category="Clinical">
        <description>Age</description>
        <displayName>Age (years)</name>
        <type>Double</type>

        <attribute-summary displayed="true">
          <group>
            <displayName>Children</displayName>
            <range type="half-open">
              <min>0</min>
              <max>18</max>
            </range>
          </group>
          <group>
            <displayName>Adults</displayName>
            <range type="half-open">
              <min>18</min>
              <max>60</max>
            </range>
          </group>
          <group>
            <displayName>Elderly</displayName>
            <range type="half-open">
              <min>60</min>
            </range>
          </group>
        </attribute-summary>
      </attribute>
    </attribute-definitions>
  </disease-definition>
<disease-definitions>

Because that seems to be what you're implying, as horrible as it is even when I do it a bit smaller. And I'm not sure how the hierarchical info fits into there.

Are attributes and their hierarchy just about displaying data? Even then, this seems better

<attribute id="demographics">
  <title>Demographics</title>
  <children>
    <child id="age" />
    <child id="gender" />
  </children>
</attribute>

<attribute id="epidemiology">
  <title>Epidemiology</title>
  <children>
    <child id="reported-date" />
    <child id="variant-strains" />
  </children>
</attribute>

<attribute id="age">
  <title>Age</title>
  <description>Age in years</description>
  <category>Clinical</category>

  <data type="double">
    <min-value>0</min-value>
  </data>
</attribute>

<attribute id="gender">
  <title>Gender</title>

  <data type="options">
    <one-of>
      <option id="M">
        <title>Male</title>
      </option>
      <option id="F">
        <title>Female</title>
      </option>
    </one-pf>
  </data>
</attribute>

and then

<disease-definitions>
  <disease id="1322843">
    <displayName>Cancer</displayName>

    <disease-attributes>
      <attribute ref-id="age">
        <displayName>Age of death</displayName>

        <displayed-in-summary>true</displayed-in-summary>
        <display format="histogram">
          <range max="10">Up to 10</range>
          <range min="10" max="25">Teenagers &amp; young adults</range>
          <range min="25" max="55">Adults</range>
          <range min="55">Elderly</range>
        </display-data>
        <display
      </attribute>

      <attribute ref-id="gender">
        <displayName>Gender of death</displayName>

        <displayed-in-summary>true</displayed-in-summary>
        <display format="pie">
          <slice option-id="M" background="#44F">Male deaths</slice>
          <slice option-id="F" background="#F44">Female deaths</slice>
        </display-data>
        <display
      </attribute>
    </disease-attributes>
  </disease>

  <disease id="1322844">
    <displayName>Influenza</displayName>

    <disease-attributes>
      <attribute ref-id="age">
        <displayName>Age of death</displayName>

        <displayed-in-summary>true</displayed-in-summary>
        <display-data format="grouped">
          <range max="10">Up to 10</range>
          <range min="10" max="25">Teenagers &amp; young adults</range>
          <range min="25" max="55">Adults</range>
          <range min="55">Elderly</range>
        </display-data>
        <display
      </attribute>
    </disease-attributes>
  </disease>

</disease-definitions>

Upvotes: 3

Related Questions