Reputation: 45
I'm currently in the middle of converting a complex XML file to csv or pandas df. I have zero experience with xml data format and all the code suggestions I found online are just not working for me. Can anyone kindly help me with this?
There are lots of elements in the data that I do not need so I won't include those here.
For privacy reasons I won't be uploading the original data here but I'll be sharing what the structure looks like.
<RefData>
<Attributes>
<Id>1011</Id>
<FullName>xxxx</FullName>
<ShortName>xx</ShortName>
<Country>UK</Country>
<Currency>GBP</Currency>
</Attributes>
<PolicyID>000</PolicyID>
<TradeDetails>
<UniqueTradeId>000</UniqueTradeId>
<Booking>UK</Booking>
<Date>12/2/2019</Date>
</TradeDetails>
</RefData>
<RefData>
<Attributes>
<Id>1012</Id>
<FullName>xxx2</FullName>
<ShortName>x2</ShortName>
<Country>UK</Country>
<Currency>GBP</Currency>
</Attributes>
<PolicyID>002</PolicyID>
<TradeDetails>
<UniqueTradeId>0022</UniqueTradeId>
<Booking>UK</Booking>
<Date>12/3/2019</Date>
</TradeDetails>
</RefData>
I would be needing everything in the tag.
Ideally I want the headers and output to look like this:
I would sincerely appreciate any help I can get on this. Thanks a mil.
Upvotes: 2
Views: 1281
Reputation: 24930
Another way to do it, using lxml and xpath:
from lxml import etree
dat = """[your FIXED xml]"""
doc = etree.fromstring(dat)
columns = []
rows = []
to_delete = ["TradeDetails",'Attributes']
body = doc.xpath('.//RefData')
for el in body[0].xpath('.//*'):
columns.append(el.tag)
for b in body:
items = b.xpath('.//*')
row = []
for item in items:
if item.tag not in to_delete:
row.append(item.text)
rows.append(row)
for col in to_delete:
if col in columns:
columns.remove(col)
pd.DataFrame(rows,columns=columns)
Output is the dataframe indicated in your question.
Upvotes: 1
Reputation: 30971
One correction concerning your input XML file: It has to contain a single main element (of any name) and within it your RefData elements.
So the input file actually contains:
<Main>
<RefData>
...
</RefData>
<RefData>
...
</RefData>
</Main>
To process the input XML file, you can use lxml package, so to import it start from:
from lxml import etree as et
Then I noticed that you actually don't need the whole parsed XML tree, so the usually applied scheme is to:
So my code looks like below:
rows = []
for _, elem in et.iterparse('RefData.xml', tag='RefData'):
rows.append({'id': elem.findtext('Attributes/Id'),
'fullname': elem.findtext('Attributes/FullName'),
'shortname': elem.findtext('Attributes/ShortName'),
'country': elem.findtext('Attributes/Country'),
'currency': elem.findtext('Attributes/Currency'),
'Policy ID': elem.findtext('PolicyID'),
'UniqueTradeId': elem.findtext('TradeDetails/UniqueTradeId'),
'Booking': elem.findtext('TradeDetails/Booking'),
'Date': elem.findtext('TradeDetails/Date')
})
elem.clear()
elem.getparent().remove(elem)
df = pd.DataFrame(rows)
To fully comprehend details, search the Web for description of lxml and each method used.
For your sample data the result is:
id fullname shortname country currency Policy ID UniqueTradeId Booking Date
0 1011 xxxx xx UK GBP 000 000 UK 12/2/2019
1 1012 xxx2 x2 UK GBP 002 0022 UK 12/3/2019
Probably the last step to perform is to save the above DataFrame in a CSV file, but I suppose you know how to do it.
Upvotes: 4