kirti desai
kirti desai

Reputation: 13

Converting multiple CSV files into single XML

I am trying to convert multiple csv files for(two for now) into xml using Element tree but I am not getting exact output. Please guide me with more efficient approach. PS:I'm a beginner here.

import csv
import xml.etree.ElementTree as ET
#from bs4 import BeautifulSoup

root = ET.Element('Policy')

with open("policy.csv","r") as p, open("Att.csv","r") as a, open("rider.csv","r") as r:
  csv_p = csv.reader(p)
  header_p = next(csv_p)
  csv_a = csv.reader(a)
  header_a = next(csv_a) 
  csv_r = csv.reader(r)
  header_r = next(csv_r)
  for row in csv_p:
    pid = row[0]
    print("\n",pid)
    for col in range(len(header_p)):
      ET.SubElement(root, header_p[col]).text = str(row[col])
      for childrow in csv_a:
        if(pid == childrow[0]):
          print("Match found")
          child = ET.SubElement(root,"child")
          for col_a in range(len(header_a)):
            ET.SubElement(child, header_a[col_a]).text = str(childrow[col_a])
            for tailrow in csv_r:
              if(childrow[1] == tailrow[0]):
                print("tail found",tailrow[0])
                tail = ET.SubElement(child,"tail")
                for col_r in range(len(header_r)):
                  ET.SubElement(tail, header_r[col_r]).text = str(tailrow[col_r])  
          r.seek(0)
    a.seek(0)

tree = ET.tostring(root, encoding="UTF-8")
#print(BeautifulSoup(tree, "xml").prettify())

with open("Output.xml", "wb") as f:
    f.write(tree)

with open('Output.xml', 'r') as f:
    print("\n\n",f.read())

The output looks like below but as you can see some tags being repeated becuase they are redundant in files i'm reading :

Policy.csv:

Pid,Name,Date 
101,Life In,3Jan2017
102,Mobile,8Aug2018 

Att.csv:

PId,AId,Name  
101,9001,Pune
101,9002,Mumbai  
102,9003,Delhi

rider.csv:

AId,RID,Name
9001,10001,Ramesh 
9001,10002,Suresh 
9002,10003,Rahul 
9002,10004,Kirti

Output:

<Policy>
    <Pid>101</Pid>
        <child>
            <PId>101</PId>
                <tail><AId>9001</AId>
                        <RID>10001</RID>
                        <Name>Ramesh</Name>
                </tail>
                <tail>
                    <AId>9001</AId>
                    <RID>10002</RID>
                    <Name>Suresh</Name>
                </tail>
                <AId>9001</AId>
                <Name>Pune</Name>
        </child>
        <child>
            <PId>101</PId>
                <tail><AId>9002</AId>
                    <RID>10003</RID>
                    <Name>Rahul</Name>
                </tail>
                <tail><AId>9002</AId>
                    <RID>10004</RID>
                    <Name>Kirti</Name>
                </tail>
                    <AId>9002</AId>
                    <Name>Mumbai</Name>
        </child>
        <Name>Life In</Name>
        <Date>3Jan2017</Date>
</Policy>

Instance of Desired Output:

<Policy>
    <Pid>101</Pid>
    <child>
      <AId>9001</AId>
        <tail>
          <RID>10001</RID>
          <Name>Ramesh</Name>
        </tail>
        <tail>                    
          <RID>10002</RID>
          <Name>Suresh</Name>
        </tail>          
      <Name>Pune</Name>
    </child>
    <Name>Life In</Name>
  <Date>3Jan2017</Date>
</Policy>

Upvotes: 0

Views: 702

Answers (1)

Daniel Haley
Daniel Haley

Reputation: 52858

If you are able to use lxml, here's an example of what I was talking about in the comments.

Hopefully I have the logic correct:

  • A policy is based on a row in Policy.csv. It's uniquely identified by Pid.
  • A child in policy is based on a row in Att.csv that has a matching PId.
  • A tail in child is based on a row in rider.csv that has a matching AId.

The first thing I would do is convert the csv into a temporary XML format.

Since the header values of your csv files would be valid element names, I'd go ahead and create elements based on those values.

If your csv files might have header values that aren't valid element names, you could use a generic element name and store the header value in an attribute. (I can change my example if needed.)

Then I'd transform the temporary XML and handle all of the grouping there. Since lxml only supports XSLT 1.0, we'd have to use Muenchian Grouping.

Example...

Python

import csv
from os import path
from lxml import etree


def csv2xml(file):
    result = etree.Element(path.splitext(file)[0])
    with open(file) as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            row_elem = etree.SubElement(result, "row")
            for entry in row:
                entry_elem = etree.SubElement(row_elem, entry.strip().lower())
                entry_elem.text = row.get(entry).strip()
    return result


csv_files = ["policy.csv", "att.csv", "rider.csv"]

temp_xml = etree.Element("policies")

for csv_file in csv_files:
    xml = csv2xml(csv_file)
    temp_xml.append(xml)

xslt = etree.parse("transform.xsl")

xml_output = etree.ElementTree(temp_xml).xslt(xslt)

print(etree.tostring(xml_output, pretty_print=True).decode())

XSLT (transform.xsl)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>

  <xsl:key name="policy" match="policy/row" use="pid"/>
  <xsl:key name="att" match="att/row" use="pid"/>
  <xsl:key name="rider" match="rider/row" use="aid"/>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="/*">
    <xsl:copy>
      <xsl:apply-templates select="policy"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="policy">
      <xsl:for-each select="row[count(.|key('policy', pid)[1])=1]">
        <policy>
          <xsl:apply-templates select="pid"/>
          <xsl:apply-templates select="key('att', pid)"/>
          <xsl:apply-templates select="name|date"/>
        </policy>
      </xsl:for-each>
  </xsl:template>

  <xsl:template match="att/row">
    <child>
      <xsl:apply-templates select="aid"/>
      <xsl:apply-templates select="key('rider', aid)"/>
      <xsl:apply-templates select="name"/>
    </child>
  </xsl:template>

  <xsl:template match="rider/row">
    <tail>
      <xsl:apply-templates select="rid|name"/>
    </tail>
  </xsl:template>

</xsl:stylesheet>

The Python will print this output:

<policies>
  <policy>
    <pid>101</pid>
    <child>
      <aid>9001</aid>
      <tail>
        <rid>10001</rid>
        <name>Ramesh</name>
      </tail>
      <tail>
        <rid>10002</rid>
        <name>Suresh</name>
      </tail>
      <name>Pune</name>
    </child>
    <child>
      <aid>9002</aid>
      <tail>
        <rid>10003</rid>
        <name>Rahul</name>
      </tail>
      <tail>
        <rid>10004</rid>
        <name>Kirti</name>
      </tail>
      <name>Mumbai</name>
    </child>
    <name>Life In</name>
    <date>3Jan2017</date>
  </policy>
  <policy>
    <pid>102</pid>
    <child>
      <aid>9003</aid>
      <name>Delhi</name>
    </child>
    <name>Mobile</name>
    <date>8Aug2018</date>
  </policy>
</policies>

Hopefully this helps.

Upvotes: 1

Related Questions