BärenHund1
BärenHund1

Reputation: 163

Transforming xml to json with python lxml

Basically, I want to transform and xml to json using python3 and the lxml-library. The important thing here is, that I want to preserve all text, tails, tags and the order of the xml. Below is an example of what my program should be able to do:

What I have

<root>
   <tag>
      Some tag-text<subtag>Some subtag-text</subtag> Some tail-text
   </tag>
</root>

What I want (python dict/json)

{
  "root":{
    "tag":[
        {"text":"Some tag-text"},
        {"subtag":{"text":"Some subtag-text"}},
        {"text":"Some tail-text"}
      ]
  }
}

This is just a very simplified example. The files I need to transform are way bigger and have more nestings.

Also, I cant use the xmltodict library for this, only lxml.

Im almost 99% sure there is some elegant way to do this recursively, but so far I haven't been able to write a solution that works the way I want it to.

Thanks a lot for the help

EDIT: Why this Question is not a duplicate of Converting XML to JSON using Python?

I understand there is no such thing as a one to one mapping from xml to json. Im specifically asking for a way that preserves the text-order like in the example above.

Also, using xmltodict doesn't achieve that goal. F.eg, transforming the xml from the example above with xmltodict will result in the following structure:

root:
    tag:
        text: 'Some tag-text Some tail-text'
        subtag: 'Some subtag-text'

you can see, that the tail part "Some tail text" was concatenated with "Some tag-text"

thanks

Upvotes: 2

Views: 7683

Answers (2)

B&#228;renHund1
B&#228;renHund1

Reputation: 163

Here's an alternative to "@Daniel Haley's" solution

def recu(root):
    my=[]
    if root.text:
        my.append({"text":root.text})
    if len(root):
        for elem in root:
            my=my+[recu(elem)]
            if elem.tail:
                my=my+[{"text":elem.tail}]
    my = my[0] if len(my)==1 else my
    return {root.tag:my}

Upvotes: 1

Daniel Haley
Daniel Haley

Reputation: 52858

I think if you need to preserve document order (what you referenced as "text-order"), XSLT is a good option. XSLT can output plain text which can be loaded as json. Luckily lxml supports XSLT 1.0.

Example...

XML Input (input.xml)

<root>
    <tag>
        Some tag-text<subtag>Some subtag-text</subtag> Some tail-text
    </tag>
</root>

XSLT 1.0 (xml2json.xsl)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="*">
    <xsl:if test="position() != 1">, </xsl:if>
    <xsl:value-of select="concat('{&quot;',
      local-name(),
      '&quot;: ')"/>
    <xsl:choose>
      <xsl:when test="count(node()) > 1">
        <xsl:text>[</xsl:text>
        <xsl:apply-templates/>
        <xsl:text>]</xsl:text>
      </xsl:when>
      <xsl:otherwise>
        <xsl:apply-templates/>
      </xsl:otherwise>
    </xsl:choose>
    <xsl:text>}</xsl:text>
  </xsl:template>

  <xsl:template match="text()">
    <xsl:if test="position() != 1">, </xsl:if>
    <xsl:value-of select="concat('{&quot;text&quot;: &quot;', 
      normalize-space(), 
      '&quot;}')"/>
  </xsl:template>

</xsl:stylesheet>

Python

import json
from lxml import etree

tree = etree.parse("input.xml")

xslt_root = etree.parse("xml2json.xsl")
transform = etree.XSLT(xslt_root)

result = transform(tree)

json_load = json.loads(str(result))

json_dump = json.dumps(json_load, indent=2)

print(json_dump)

For informational purposes, the output of the xslt (result) is:

{"root": {"tag": [{"text": "Some tag-text"}, {"subtag": {"text": "Some subtag-text"}}, {"text": "Some tail-text"}]}}

The printed output from Python (after loads()/dumps()) is:

{
  "root": {
    "tag": [
      {
        "text": "Some tag-text"
      },
      {
        "subtag": {
          "text": "Some subtag-text"
        }
      },
      {
        "text": "Some tail-text"
      }
    ]
  }
}

Upvotes: 4

Related Questions