Alberto Valdivia
Alberto Valdivia

Reputation: 25

Change the XML structure to a new one with python

I want to change the structure from an XML to another standard structure given to me. I believe I can achieve that through the following steps:

  1. Get all the tags and their attributes, so I can know what to modify, remove or add.
  2. Change the tags name (i.e. informaltable to table, or , sect1 to section)
  3. Establish certain standard attributes for the different tags, and show it in a dictionary (i.e. all the section, title and table tags must have these attributes ---- section:{"xmlns:xsi","id","type","xsi:noNamespaceSchemaLocation"} , title:"id" , table:{"frame","id"} .
  4. Give a random alpha-numerical id to every tag that has the id attribute and it must never repeat itself(i.e. id=id-824fc56b-431b-4ad3-e933-f0fc222e50d3)
  5. Modify, add and remove attributes values for certain tags (i.e. frame was frame=all and now is frame=any) (i.e. delete the rowsep attribute in the colspec tag).
  6. Remove specific tags(i.e. remove the anchor tags and of course all of their attributes) (I hope this doesn't affect the whole hierarchy).

I have this xml example

<section xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="id-c3ee53e4-e2ef-441b-8f3b-7320c4e32ef8" type="policy" xsi:noNamespaceSchemaLocation="urn:fontoxml:cpa.xsd:5.0">
  <title id="id-f0497441-5ecb-47ee-b7c0-263832a9e402">
    <anchor id="_Toc493170182"/>
    <anchor id="__RefHeading___Toc3574_3674829928"/>
    <anchor id="_Toc72503731"/>
    <anchor id="_Toc69390724"/>
    <anchor id="_Toc493496869"/>
    Abbreviations of Terms
  </title>
      <table frame="all" id="id-6837f232-02e3-4e7a-ce8d-cb2df48256ac">
        <tgroup cols="2" id="id-437c0d54-7257-4d34-a73d-351d533f0460">
          <colspec colname="column-0" colnum="1" colsep="1" rowsep="1" colwidth="0.2*" id="id-c87e1040-c2d7-4b15-fb0c-86557d201235" />
          <colspec colname="column-1" colnum="2" colsep="1" rowsep="1" colwidth="0.8*" id="id-5bebcf85-440b-416e-b2f9-72e47d5bb4f7" />
          <thead id="id-ff67f8a7-5baf-4a42-ac31-09c0f99cceed">
            <row id="id-542df999-7736-4cc2-e725-1b7b106e08d6">
              <entry rowsep="1" colsep="1" colname="column-0" id="id-54a7d605-21ff-44db-c1f6-03111db180c7">
                <para id="id-f43f7fb1-cd40-4b4a-88f2-02e55e786a5e">
                  <emphasis style="bold">Abbreviation
                  </emphasis>
                </para>
              </entry>
              <entry rowsep="1" colsep="1" colname="column-1" id="id-aecec4c6-f85b-490e-9b72-99c6764b49cf">
                <para id="id-4d89100a-4e4c-419a-d081-f776bcf9083e">
                  <emphasis style="bold">Definition
                  </emphasis>
                </para>
              </entry>
            </row>
          </thead>
          <tbody id="id-824fc56b-431b-4ad3-e933-f0fc222e50d3">
            <row id="id-620a8ff6-0189-41c7-e9af-dc9498ce703e">
              <entry rowsep="1" colsep="1" colname="column-0" id="id-fb941cc0-287d-4760-a5a0-87419fa66d68">
                <para id="id-127a8a37-9705-496b-87ee-303bcfd52a25">A/C</para>
              </entry>
              <entry rowsep="1" colsep="1" colname="column-1" id="id-317ad682-6e02-43c3-b724-5d50683c8f79">
                <para id="id-c7c2fac5-f286-4802-b8d6-2e54fa2cad3c">AirCraft</para>
              </entry>
            </row>
          </tbody>  
        </tgroup>
      </table>
</section>

And this is the code that I have so far

from lxml import etree
import numpy as np

#Parsing the xml file and creating lists
tree = etree.parse("InitialFile")
root = tree.getroot()
Lista = []
tags = []

#Get the unique tags values
for element in root.iter():
    Lista.append(element.tag)
tags = np.unique(Lista)

#Show the unique tag[attributes] pairs
for tag in tags:
    print(tag,root.xpath(f'//{tag}')[0].attrib.keys())
    
#Changes the tag name to the required's tag's name
for p in tree.findall(".//sect1"):
    p.tag = ("section")
for p in tree.findall(".//informaltable"):
    p.tag = ("table")    
    
#Modify the tag's attributes to its desired form
for cy in root.xpath('//section'):
    cy.attrib['xmlns:xsi']='http://www.w3.org/2001/XMLSchema-instance' #it doesnt accept : as part of the attribute's name and i don't know why
    cy.attrib['id']=random() #this doesn't work yet
    cy.attrib['type']='policy'
    cy.attrib['xsi:noNamespaceSchemaLocation']='urn:fontoxml:cpa.xsd:1.0'#it doesnt accept :as part of the attribute's name and i don't know why

#Modify the attributes values
for t in root.xpath('//title'):
    t.attrib['id']='random()
    
for p in root.xpath('//section'):
    p.attrib['id']=random()
    p.attrib['type']='policy'

for p in root.xpath('//table'):
    p.attrib['id']=random()
    
for ct in root.xpath('//colspec'):
    ct.attrib.pop("rowsep", None)

#Print the new xml to make sure it worked:
print(etree.tostring(root).decode())
    
tree.write("Final file.xml")

If you have any other ideas please feel free to share.

Upvotes: 0

Views: 120

Answers (1)

Martin Honnen
Martin Honnen

Reputation: 167446

I agree that this is a task for XSLT (which can be used by lxml), here is an example stylesheet that tries to implement some of your requirements in a modular way by delegating each change to a template of its own:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    version="1.0">

  <xsl:output method="xml"/>

  <xsl:template match="@* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="sect1">
      <section>
          <xsl:apply-templates select="@* | node()"/>
      </section>
  </xsl:template>
  
  <xsl:template match="informaltable">
      <table>
          <xsl:apply-templates select="@* | node()"/>
      </table>
  </xsl:template>
  
  <xsl:template match="@id">
      <xsl:attribute name="{name()}">
          <xsl:value-of select="generate-id()"/>
      </xsl:attribute>
  </xsl:template>
  
  <xsl:template match="@xsi:noNamespaceSchemaLocation">
      <xsl:attribute name="{name()}" namespace="{namespace-uri()}">urn:fontoxml:cpa.xsd:1.0</xsl:attribute>
  </xsl:template>
  
  <xsl:template match="colspec/@rowsep"/>

</xsl:stylesheet>

https://xsltfiddle.liberty-development.net/bET2rXs

I hope with that as a starting point and any XSLT tutorial or introduction you can work it out.

Upvotes: 1

Related Questions