Tim
Tim

Reputation: 105

Converting complex XML to CSV using Python or XSLT

Using Python or XSLT, I would like to know how to convert highly complex, hierarchical nested XML file to CSV including all the sub-elements and without hard coding as few element nodes as possible or is rational/effective?

Please find attached simplified XML example and the output CSV to get a better understanding of what I’m trying to achieve.

The actual XML file has much more elements but the data hierarchy and the nesting is like in the example. <InvoiceRow> element and its sub-elements are the only repeating elements in the XML file, all the other elements are static that are repeated in the output CSV as many times as there are <InvoiceRow> elements in the XML file.

It’s the repeating <InvoiceRow> element that is causing trouble for me. Elements that don’t repeat are easy to convert to CSV without hard coding any elements.

Complex XML scenarios, with hierarchical data structures and multiple one-to-many relationships all being stored in a single XML file. Structured text file.

Example XML input:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Invoice>
    <SellerDetails>
        <Identifier>1234-1</Identifier>
        <SellerAddress>
            <SellerStreet>Street1</SellerStreet>
            <SellerTown>Town1</SellerTown>
        </SellerAddress>
    </SellerDetails>
    <BuyerDetails>
        <BuyerIdentifier>1234-2</BuyerIdentifier>
        <BuyerAddress>
            <BuyerStreet>Street2</BuyerStreet>
            <BuyerTown>Town2</BuyerTown>
        </BuyerAddress>
    </BuyerDetails>
    <BuyerNumber>001234</BuyerNumber>
    <InvoiceDetails>
        <InvoiceNumber>0001</InvoiceNumber>
    </InvoiceDetails>
    <InvoiceRow>
        <ArticleName>Article1</ArticleName>
        <RowText>Product Text1</RowText>
        <RowText>Product Text2</RowText>
        <RowAmount AmountCurrencyIdentifier="EUR">10.00</RowAmount>
    </InvoiceRow>
    <InvoiceRow>
        <ArticleName>Article2</ArticleName>
        <RowText>Product Text11</RowText>
        <RowText>Product Text22</RowText>
        <RowAmount AmountCurrencyIdentifier="EUR">20.00</RowAmount>
    </InvoiceRow>
    <InvoiceRow>
        <ArticleName>Article3</ArticleName>
        <RowText>Product Text111</RowText>
        <RowText>Product Text222</RowText>
        <RowAmount AmountCurrencyIdentifier="EUR">30.00</RowAmount>
    </InvoiceRow>
    <EpiDetails>
        <EpiPartyDetails>
            <EpiBfiPartyDetails>
                <EpiBfiIdentifier IdentificationSchemeName="BIC">XXXXX</EpiBfiIdentifier>
            </EpiBfiPartyDetails>
        </EpiPartyDetails>
    </EpiDetails>
    <InvoiceUrlText>Some text</InvoiceUrlText>
</Invoice>


Example CSV output:

Identifier,SellerStreet,SellerTown,BuyerIdentifier,BuyerStreet,BuyerTown,BuyerNumber,InvoiceNumber,ArticleName,RowText,RowText,RowAmount,EpiBfiIdentifier,InvoiceUrlText
1234-1,Street1,Town1,1234-2,Street2,Town2,1234,1,Article1,Product Text1,Product Text2,10,XXXXX,Some text
1234-1,Street1,Town1,1234-2,Street2,Town2,1234,1,Article2,Product Text11,Product Text22,20,XXXXX,Some text
1234-1,Street1,Town1,1234-2,Street2,Town2,1234,1,Article3,Product Text111,Product Text222,30,XXXXX,Some text

Upvotes: 0

Views: 844

Answers (2)

panda912
panda912

Reputation: 366

I have done similar case like your requirements, I have created one package base on untangle, a package which can parse your XML to pure python objects like:

<?xml version="1.0"?>
<root>
    <child name="child1"/>
</root>

to

obj.root.child['name'] # u'child1'

then you can easily write some code to traverse the object to get what you want. For example, you can do something like get_items_by_tag(InvoiceRow). Hope it helps!

Upvotes: 0

michael.hor257k
michael.hor257k

Reputation: 116992

Consider the following example:

XML

<Invoice>
    <SellerDetails>
        <Identifier>1234-1</Identifier>
        <SellerAddress>
            <SellerStreet>Street1</SellerStreet>
            <SellerTown>Town1</SellerTown>
        </SellerAddress>
    </SellerDetails>
    <BuyerDetails>
        <BuyerIdentifier>1234-2</BuyerIdentifier>
        <BuyerAddress>
            <BuyerStreet>Street2</BuyerStreet>
            <BuyerTown>Town2</BuyerTown>
        </BuyerAddress>
    </BuyerDetails>
    <BuyerNumber>001234</BuyerNumber>
    <InvoiceDetails>
        <InvoiceNumber>0001</InvoiceNumber>
    </InvoiceDetails>
    <InvoiceRow>
        <ArticleName>Article1</ArticleName>
        <RowText>Product Text1</RowText>
        <RowText>Product Text2</RowText>
        <RowAmount AmountCurrencyIdentifier="EUR">10.00</RowAmount>
    </InvoiceRow>
    <InvoiceRow>
        <ArticleName>Article2</ArticleName>
        <RowText>Product Text11</RowText>
        <RowText>Product Text22</RowText>
        <RowAmount AmountCurrencyIdentifier="EUR">20.00</RowAmount>
    </InvoiceRow>
    <InvoiceRow>
        <ArticleName>Article3</ArticleName>
        <RowText>Product Text111</RowText>
        <RowText>Product Text222</RowText>
        <RowAmount AmountCurrencyIdentifier="EUR">30.00</RowAmount>
    </InvoiceRow>
    <EpiDetails>
        <EpiPartyDetails>
            <EpiBfiPartyDetails>
                <EpiBfiIdentifier IdentificationSchemeName="BIC">XXXXX</EpiBfiIdentifier>
            </EpiBfiPartyDetails>
        </EpiPartyDetails>
    </EpiDetails>
    <InvoiceUrlText>Some text</InvoiceUrlText>
</Invoice>

XSLT 1.0

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>

<xsl:template match="Invoice">
    <xsl:variable name="common-head">
        <xsl:value-of select="SellerDetails/Identifier"/>
        <xsl:text>,</xsl:text>
        <xsl:value-of select="BuyerDetails/BuyerIdentifier"/>
        <xsl:text>,</xsl:text>
        <xsl:value-of select="InvoiceDetails/InvoiceNumber"/>
        <xsl:text>,</xsl:text>
        <!-- add more here -->
    </xsl:variable>
    <xsl:variable name="common-tail">
        <xsl:value-of select="EpiDetails/EpiPartyDetails/EpiBfiPartyDetails/EpiBfiIdentifier"/>
        <xsl:text>,</xsl:text>
        <!-- add more here -->
        <xsl:value-of select="InvoiceUrlText"/>
    </xsl:variable>
    <!-- header -->
    <xsl:text>SellerIdentifier,BuyerIdentifier,InvoiceNumber,ArticleName,RowText,RowText,RowAmount,EpiBfiIdentifier,InvoiceUrlText&#10;</xsl:text>
    <!-- data -->
    <xsl:for-each select="InvoiceRow">
        <xsl:copy-of select="$common-head"/>
        <xsl:value-of select="ArticleName"/>
        <xsl:text>,</xsl:text>  
        <xsl:value-of select="RowAmount"/>
        <xsl:text>,</xsl:text>  
        <!-- add more here -->
        <xsl:copy-of select="$common-tail"/>
        <xsl:text>&#10;</xsl:text>
    </xsl:for-each>
</xsl:template>

</xsl:stylesheet>

Result

SellerIdentifier,BuyerIdentifier,InvoiceNumber,ArticleName,RowText,RowText,RowAmount,EpiBfiIdentifier,InvoiceUrlText
1234-1,1234-2,0001,Article1,10.00,XXXXX,Some text
1234-1,1234-2,0001,Article2,20.00,XXXXX,Some text
1234-1,1234-2,0001,Article3,30.00,XXXXX,Some text

Added in response to:

Is there a way in XSLT to get the same results using loop? For example loop through and output all the elements and the sub-elements except the InvoiceRow elements and then vice versa?

If you prefer, you could try something like:

XSLT 1.0

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>

<xsl:template match="Invoice">
    <xsl:variable name="invoice-fields" select="//*[not(*) and not(ancestor::InvoiceRow)]" />
    <xsl:variable name="common-data">
        <xsl:for-each select="$invoice-fields">
            <xsl:value-of select="."/>
            <xsl:text>,</xsl:text>  
        </xsl:for-each> 
    </xsl:variable>
    <!-- header -->
    <xsl:for-each select="$invoice-fields">
        <xsl:value-of select="name()"/>
        <xsl:text>,</xsl:text>  
    </xsl:for-each>
    <xsl:for-each select="InvoiceRow[1]/*">
        <xsl:value-of select="name()"/>
        <xsl:if test="position()!=last()">,</xsl:if>
    </xsl:for-each>
    <xsl:text>&#10;</xsl:text>
    <!-- data -->
    <xsl:for-each select="InvoiceRow">
        <xsl:copy-of select="$common-data"/>
        <xsl:for-each select="*">
            <xsl:value-of select="."/>
            <xsl:if test="position()!=last()">,</xsl:if>
        </xsl:for-each> 
        <xsl:text>&#10;</xsl:text>
    </xsl:for-each>
</xsl:template>

</xsl:stylesheet>

The result here would be:

Identifier,SellerStreet,SellerTown,BuyerIdentifier,BuyerStreet,BuyerTown,BuyerNumber,InvoiceNumber,EpiBfiIdentifier,InvoiceUrlText,ArticleName,RowText,RowText,RowAmount
1234-1,Street1,Town1,1234-2,Street2,Town2,001234,0001,XXXXX,Some text,Article1,Product Text1,Product Text2,10.00
1234-1,Street1,Town1,1234-2,Street2,Town2,001234,0001,XXXXX,Some text,Article2,Product Text11,Product Text22,20.00
1234-1,Street1,Town1,1234-2,Street2,Town2,001234,0001,XXXXX,Some text,Article3,Product Text111,Product Text222,30.00

i.e. listing all invoice fields before the row fields.

Upvotes: 2

Related Questions