iceman225
iceman225

Reputation: 109

Awk parse xml to csv

i have a xml that i want to parse into csv, like i started to work with awk, i would like to continue with it but i know it is possible to do it with over language like perl also i found xmlstarlet but i don't have permission to install on server so i'am open on over solutions. So my iinput xml is

<?xml version="1.0"?>
<root>
  <record>
   <id_client>50C</id_client>  
  <data>
          <mail>[email protected]</mail>
          <adress>10  </adress>
          <num_tel>001</num_tel>
          <key>C</key>
      <contact>
        <name>toto</name>
        <birth>01/30/009</birth>
        <city>London</city>
      </contact>
  </data> 
  <data>
          <mail>[email protected]</mail>
          <adress>20</adress>
          <num_tel>02200</num_tel>
          <key>D1</key>
      <contact>
        <name>tata</name>
        <birth>02/08/2004</birth>
        <city>Bruges</city>
      </contact>
  </data> 
</record>
   <record>
   <id_client>70D</id_client>  
  <data>
          <mail>[email protected]</mail>
          <adress>7Bcd</adress>
          <num_tel>5555</num_tel>
          <key>D2</key>
      <contact>
        <name>titi</name>
        <birth>05/07/2014</birth>
        <city>Paris</city>
      </contact>
  </data>
  <data>
          <mail>[email protected]</mail>
          <adress>888</adress>
          <num_tel>881.0</num_tel>
          <key>D3</key>
      <contact>
        <name>awk</name>
        <birth>05/08/1999</birth>
        <city>Lisbone</city>
      </contact>
  </data>

I would like to output in an over file this csv with hearders

id_client;mail;num_tel;key 
50C;[email protected];001;C
50C,[email protected];02200;D1
70D;[email protected];5555;D2 
70D;[email protected];881.0;D3

Upvotes: 1

Views: 10537

Answers (5)

Ram&#243;n Gil Moreno
Ram&#243;n Gil Moreno

Reputation: 819

This answer is given in order to illustrate the text-based procedure to extract the info from the specific .xml formatting shown in the question description (the same .xml can be formatted differently -e.g. no line feeds- making the process described here unsuitable).

If possible, use a XML-specific tool as xmllint.

Text-based one liner:

cat input.xml | grep -e \<mail\> -e \<adress\> -e \<num_tel\> -e \<key\> | sed 's/<[^>]*>//g' | sed 's/^\s*//g; s/\s*$//g' | paste -d ";" - - - -

Explanation:

  1. Read input file (cat input.xml)
  2. Get the appropriate tags lines (with grep)
  3. Remove XML tags with, leaving only the tag contents (with sed)
  4. Trim spaces (with sed again; two expressions in a single sed command: one for the leading spaces and one for the traling spaces)
  5. Paste every 4 lines as columns (with paste)

Upvotes: 6

JJoao
JJoao

Reputation: 5337

#!/usr/bin/perl
use XML::DT;

my %handler=(
  -default  => sub{ $c},                # $c - element contents
  -type     => { data => "MAP" },       # data suns became (tag => $c)

  id_client => sub{ father(id=>$c);},
  data      => sub{ print father("id"),";$c->{mail};$c->{num_tel};$c->{key}\n"},
);
dt(shift, %handler);

Upvotes: 0

glenn jackman
glenn jackman

Reputation: 246744

You're going to run into lots of problems parsing XML line-by-line: XML is not a line-oriented data format.

Use an XML-specific tool. Here's how simple it can be:

xmlstarlet sel -t \
  -m / -o "id_client;mail;num_tel;key" -n -b \
  -m /root/record/data -v ../id_client -o ";" -v mail -o ";" -v num_tel -o ";" -v key -n \
file.xml
id_client;mail;num_tel;key
50C;[email protected];001;C
50C;[email protected];02200;D1
70D;[email protected];5555;D2
70D;[email protected];881.0;D3

Upvotes: 5

Wintermute
Wintermute

Reputation: 44023

With Python, which has an XML parser in its standard library and a decent chance of being preinstalled on the server to which you have to deploy:

#!/usr/bin/python

import xml.etree.ElementTree as ET
import sys

tree = ET.parse(sys.argv[1])
root = tree.getroot()

print "id_client;mail;num_tel;key"

# Rudimentary error handling: If a field is not there,
# print (nil) in its stead.    
def xml_read(node, key):
    p = node.find(key)
    if p is None:
        return "(nil)"
    return p.text

for r in root.iter("record"):
    for d in r.iter("data"):
        print xml_read(r, "id_client") + ";" + xml_read(d, "mail") + ";" + xml_read(d, "num_tel") + ";" + xml_read(d, "key")

Alternatively, if you have access to an XSLT processor (although I dare not hope for this), you could use the following stylesheet:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/root">id_client;mail;num_tel;key
<xsl:for-each select="record">
  <xsl:for-each select="data"><xsl:value-of select="../id_client"/>;<xsl:value-of select="mail"/>;<xsl:value-of select="num_tel"/>;<xsl:value-of select="key"/><xsl:text>&#xa;</xsl:text></xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet> 

Use

xsltproc filename.xsl filename.xml

or

xalan -xsl filename.xsl -in filename.xml

where filename.xsl is the file that contains the above XSLT. If you have a different XSLT processor, it will work just as well; consult its manpage to see how it wants to be invoked.

Upvotes: 2

ShellFish
ShellFish

Reputation: 4551

You could try this:

awk 'BEGIN{ RS="record"; FS="[<>]" } { print $10 "," $14 "," $18 }' file

Which is not the most portable way to do it. Better would be:

awk -F'[<>]' '$2 == "mail" || $2 == "adress" { printf "%s\, ", $3 }; $2 == "num_tel" { print $3 }' a

That way you can add other lines without a problem, as long as you don't change the keys.

Upvotes: 1

Related Questions