raven
raven

Reputation: 17

Is there a simple way to convert xml format to csv using python?

I have the below xml that i would want to convert to csv ( delimiter like '|' or octal ) using python. I tried converting the xml into dict and then to csv . I am looking if there are any easy or efficient methods to do so.

Problems with the below code :

  1. There is a tag xyz , which isn't needed in csv , so how to escape or ignore that? All i want is the data from tag abc.
  2. Would want the nested tag as prefix to underlying keys
  3. Expected output is a line with keys as header and the values below it.

XML sample :

    <?xml version="1.0" encoding="utf-8"?>
    <xyz date="2019-07-01T09:00:29">
      <abc>
        <id>23</id>
        <uniqueid>23_0</uniqueid>
        <Name></Name>
        <Rate>
          <mrp>6.40000</mrp>
          <discount>10.00%</discount>
          <discountmonths>2</discountmonths>
        </Rate>
        <fee>
          <type>off</type>
          <minimumfee>£1,500.75</minimumfee>
          <maxfee>£10K</maxfee>
        </fee>
      </abc>
      <abc>
        <id>35</id>
        <uniqueid>35_0</uniqueid>
        <Name></Name>
        <Rate>
          <mrp>7.90000</mrp>
          <discount>5.00%</discount>
          <discountmonths>5</discountmonths>
        </Rate>
        <fee>
          <type>offer</type>
          <minimumfee>£1k</minimumfee>
          <maxfee>£22,000</maxfee>
        </fee>
      </abc>
    </xyz>

Code : ( don't mind the immaturity in the code, just a beginner , mostly a copy paste from multiple sites.. trying to understand if this works and then thought i will know where to start or modify the existing )

    import xml.etree.ElementTree as ET
    import xmltodict
    import csv
    tree = ET.parse('myxml_1.xml')
    xml_data = tree.getroot()
    xmlstr = ET.tostring(xml_data, encoding='utf-8', method='xml')
    data_dict = dict(xmltodict.parse(xmlstr))

    with open('test1.csv','w') as f:
        w = csv.writer(f)
        w.writerow(data_dict.keys())
        w.writerow(data_dict.values())

output expected :

    id|uniqueid|Name|rate_mrp|rate_discount|rate_discountmonths|fee_type|fee_minimumfee|fee_maxfee
    23|23_0||6.40000|10.00%|2|off|£1,500.75|£10K
    35|35_0||7.90000|5.00%|5|offer|£1k|£22,000

Upvotes: 0

Views: 477

Answers (1)

DeepSpace
DeepSpace

Reputation: 81594

I'd do this in a very explicit way rather than trying to hack xmltodict to fit your needs.

The only downside I see with this approach is a bit of repetition with the hardcoded headers and tags names.

Also, I don't know how regular you input XML is going to be. If it's possible that some of the tags will not be present then you will need to add some error handling (because node.find will return None, then .text will cause an AttributeError).

rows = []
for abc_node in tree.findall('abc'):
    rate_node = abc_node.find('Rate')
    fee_node = abc_node.find('fee')
    row = {'id': abc_node.find('id').text,
           'uniqueid': abc_node.find('uniqueid').text,
           'Name': abc_node.find('Name').text,
           'rate_mrp': rate_node.find('mrp').text,
           'rate_discount': rate_node.find('discount').text,
           'rate_discountmonths': rate_node.find('discountmonths').text,
           'fee_type': fee_node.find('type').text,
           'fee_minimumfee': fee_node.find('minimumfee').text,
           'fee_maxfee': fee_node.find('maxfee').text}
    rows.append(row)

with open('test.csv', 'w', encoding='utf8') as f:
    headers = ['id', 'uniqueid', 'Name', 'rate_mrp', 'rate_discount', 'rate_discountmonths',
               'fee_type', 'fee_minimumfee', 'fee_maxfee']
    dict_writer = csv.DictWriter(f, fieldnames=headers, lineterminator='\n')
    dict_writer.writeheader()
    dict_writer.writerows(rows)

Output

id,uniqueid,Name,rate_mrp,rate_discount,rate_discountmonths,fee_type,fee_minimumfee,fee_maxfee
23,23_0,,6.40000,10.00%,2,off,"£1,500.75",£10K
35,35_0,,7.90000,5.00%,5,offer,£1k,"£22,000" 

If you want | as delimiter just add delimiter='|' to csv.DictWriter(f, fieldnames=headers, lineterminator='\n')

then the output is

id|uniqueid|Name|rate_mrp|rate_discount|rate_discountmonths|fee_type|fee_minimumfee|fee_maxfee
23|23_0||6.40000|10.00%|2|off|£1,500.75|£10K
35|35_0||7.90000|5.00%|5|offer|£1k|£22,000

Upvotes: 1

Related Questions