Reputation: 163
Basically, I want to transform and xml to json using python3 and the lxml-library. The important thing here is, that I want to preserve all text, tails, tags and the order of the xml. Below is an example of what my program should be able to do:
What I have
<root>
<tag>
Some tag-text<subtag>Some subtag-text</subtag> Some tail-text
</tag>
</root>
What I want (python dict/json)
{
"root":{
"tag":[
{"text":"Some tag-text"},
{"subtag":{"text":"Some subtag-text"}},
{"text":"Some tail-text"}
]
}
}
This is just a very simplified example. The files I need to transform are way bigger and have more nestings.
Also, I cant use the xmltodict library for this, only lxml.
Im almost 99% sure there is some elegant way to do this recursively, but so far I haven't been able to write a solution that works the way I want it to.
Thanks a lot for the help
EDIT: Why this Question is not a duplicate of Converting XML to JSON using Python?
I understand there is no such thing as a one to one mapping from xml to json. Im specifically asking for a way that preserves the text-order like in the example above.
Also, using xmltodict doesn't achieve that goal. F.eg, transforming the xml from the example above with xmltodict will result in the following structure:
root:
tag:
text: 'Some tag-text Some tail-text'
subtag: 'Some subtag-text'
you can see, that the tail part "Some tail text" was concatenated with "Some tag-text"
thanks
Upvotes: 2
Views: 7683
Reputation: 163
Here's an alternative to "@Daniel Haley's" solution
def recu(root):
my=[]
if root.text:
my.append({"text":root.text})
if len(root):
for elem in root:
my=my+[recu(elem)]
if elem.tail:
my=my+[{"text":elem.tail}]
my = my[0] if len(my)==1 else my
return {root.tag:my}
Upvotes: 1
Reputation: 52858
I think if you need to preserve document order (what you referenced as "text-order"), XSLT is a good option. XSLT can output plain text which can be loaded as json. Luckily lxml supports XSLT 1.0.
Example...
XML Input (input.xml)
<root>
<tag>
Some tag-text<subtag>Some subtag-text</subtag> Some tail-text
</tag>
</root>
XSLT 1.0 (xml2json.xsl)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="*">
<xsl:if test="position() != 1">, </xsl:if>
<xsl:value-of select="concat('{"',
local-name(),
'": ')"/>
<xsl:choose>
<xsl:when test="count(node()) > 1">
<xsl:text>[</xsl:text>
<xsl:apply-templates/>
<xsl:text>]</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates/>
</xsl:otherwise>
</xsl:choose>
<xsl:text>}</xsl:text>
</xsl:template>
<xsl:template match="text()">
<xsl:if test="position() != 1">, </xsl:if>
<xsl:value-of select="concat('{"text": "',
normalize-space(),
'"}')"/>
</xsl:template>
</xsl:stylesheet>
Python
import json
from lxml import etree
tree = etree.parse("input.xml")
xslt_root = etree.parse("xml2json.xsl")
transform = etree.XSLT(xslt_root)
result = transform(tree)
json_load = json.loads(str(result))
json_dump = json.dumps(json_load, indent=2)
print(json_dump)
For informational purposes, the output of the xslt (result
) is:
{"root": {"tag": [{"text": "Some tag-text"}, {"subtag": {"text": "Some subtag-text"}}, {"text": "Some tail-text"}]}}
The printed output from Python (after loads()/dumps()) is:
{
"root": {
"tag": [
{
"text": "Some tag-text"
},
{
"subtag": {
"text": "Some subtag-text"
}
},
{
"text": "Some tail-text"
}
]
}
}
Upvotes: 4