Reputation: 339
Lets assume we have an arbitrary XML document like below
<?xml version="1.0" encoding="UTF-8"?>
<programs xmlns="http://something.org/schema/s/program">
<program xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://something.org/schema/s/program http://something.org/schema/s/program.xsd">
<orgUnitId>Organization 1</orgUnitId>
<requiredLevel>academic bachelor</requiredLevel>
<requiredLevel>academic master</requiredLevel>
<programDescriptionText xml:lang="nl">Here is some text; blablabla</programDescriptionText>
<searchword xml:lang="nl">Scrum master</searchword>
</program>
<program xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://something.org/schema/s/program http://something.org/schema/s/program.xsd">
<requiredLevel>bachelor</requiredLevel>
<requiredLevel>academic master</requiredLevel>
<requiredLevel>academic bachelor</requiredLevel>
<orgUnitId>Organization 2</orgUnitId>
<programDescriptionText xml:lang="nl">Text from another organization about some stuff.</programDescriptionText>
<searchword xml:lang="nl">Excutives</searchword>
</program>
<program xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<orgUnitId>Organization 3</orgUnitId>
<programDescriptionText xml:lang="nl">Also another huge text description from another organization.</programDescriptionText>
<searchword xml:lang="nl">Negotiating</searchword>
<searchword xml:lang="nl">Effective leadership</searchword>
<searchword xml:lang="nl">negotiating techniques</searchword>
<searchword xml:lang="nl">leadership</searchword>
<searchword xml:lang="nl">strategic planning</searchword>
</program>
</programs>
Currently I'm looping
over the elements I need by using their absolute paths, since I'm not able to use any of the get
or find
methods in ElementTree. As such, my code looks like below:
import pandas as pd
import xml.etree.ElementTree as ET
import numpy as np
import itertools
tree = ET.parse('data.xml')
root = tree.getroot()
root.tag
dfcols=['organization','description','level','keyword']
organization=[]
description=[]
level=[]
keyword=[]
for node in root:
for child in
node.findall('.//{http://something.org/schema/s/program}orgUnitId'):
organization.append(child.text)
for child in node.findall('.//{http://something.org/schema/s/program}programDescriptionText'):
description.append(child.text)
for child in node.findall('.//{http://something.org/schema/s/program}requiredLevel'):
level.append(child.text)
for child in node.findall('.//{http://something.org/schema/s/program}searchword'):
keyword.append(child.text)
The goal, of course, is to create one dataframe. However, since each node in the XML file contains one or multiple elements, such as requiredLevel
or searchword
I'm currently losing data when I'm casting it to a dataframe by either:
df=pd.DataFrame(list(itertools.zip_longest(organization,
description,level,searchword,
fillvalue=np.nan)),columns=dfcols)
or using pd.Series
as given here or another solution which I don't seem to get it fit from here
My best bet is not to use Lists at all, since they don't seem to index the data correctly. That is, I lose data from the 2nd to Xth child node. But right now I'm stuck, and don't see any other options.
What my end result should look like is this:
organization description level keyword
Organization 1 .... academic bachelor, Scrum master
academic master
Organization 2 .... bachelor, Executives
academic master,
academic bachelor
Organization 3 .... Negotiating,
Effective leadership,
negotiating techniques,
....
Upvotes: 1
Views: 433
Reputation: 107642
Consider building a list of dictionaries with comma-collapsed text values. Then pass list into the pandas.DataFrame
constructor:
dicts = []
for node in root:
orgs = ", ".join([org.text for org in node.findall('.//{http://something.org/schema/s/program}orgUnitId')])
desc = ", ".join([desc.text for desc in node.findall('.//{http://something.org/schema/s/program}programDescriptionText')])
lvls = ", ".join([lvl.text for lvl in node.findall('.//{http://something.org/schema/s/program}requiredLevel')])
wrds = ", ".join([wrd.text for wrd in node.findall('.//{http://something.org/schema/s/program}searchword')])
dicts.append({'organization': orgs, 'description': desc, 'level': lvls, 'keyword': wrds})
final_df = pd.DataFrame(dicts, columns=['organization','description','level','keyword'])
Output
print(final_df)
# organization description level keyword
# 0 Organization 1 Here is some text; blablabla academic bachelor, academic master Scrum master
# 1 Organization 2 Text from another organization about some stuff. bachelor, academic master, academic bachelor Excutives
# 2 Organization 3 Also another huge text description from anothe... Negotiating, Effective leadership, negotiating...
Upvotes: 2
Reputation: 1286
A lightweight xml_to_dict
converter can be found here. It can be improved by this to handle namespaces.
def xml_to_dict(xml='', remove_namespace=True):
"""Converts an XML string into a dict
Args:
xml: The XML as string
remove_namespace: True (default) if namespaces are to be removed
Returns:
The XML string as dict
Examples:
>>> xml_to_dict('<text><para>hello world</para></text>')
{'text': {'para': 'hello world'}}
"""
def _xml_remove_namespace(buf):
# Reference: https://stackoverflow.com/a/25920989/1498199
it = ElementTree.iterparse(buf)
for _, el in it:
if '}' in el.tag:
el.tag = el.tag.split('}', 1)[1]
return it.root
def _xml_to_dict(t):
# Reference: https://stackoverflow.com/a/10077069/1498199
from collections import defaultdict
d = {t.tag: {} if t.attrib else None}
children = list(t)
if children:
dd = defaultdict(list)
for dc in map(_xml_to_dict, children):
for k, v in dc.items():
dd[k].append(v)
d = {t.tag: {k: v[0] if len(v) == 1 else v for k, v in dd.items()}}
if t.attrib:
d[t.tag].update(('@' + k, v) for k, v in t.attrib.items())
if t.text:
text = t.text.strip()
if children or t.attrib:
if text:
d[t.tag]['#text'] = text
else:
d[t.tag] = text
return d
buffer = io.StringIO(xml.strip())
if remove_namespace:
root = _xml_remove_namespace(buffer)
else:
root = ElementTree.parse(buffer).getroot()
return _xml_to_dict(root)
So let s
be the string which holds your xml. We can convert it to a dict via
d = xml_to_dict(s, remove_namespace=True)
Now the solution is straight forward:
rows = []
for program in d['programs']['program']:
cols = []
cols.append(program['orgUnitId'])
cols.append(program['programDescriptionText']['#text'])
try:
cols.append(','.join(program['requiredLevel']))
except KeyError:
cols.append('')
try:
searchwords = program['searchword']['#text']
except TypeError:
searchwords = []
for searchword in program['searchword']:
searchwords.append(searchword['#text'])
searchwords = ','.join(searchwords)
cols.append(searchwords)
rows.append(cols)
df = pd.DataFrame(rows, columns=['organization', 'description', 'level', 'keyword'])
Upvotes: 1