Pravin Satav
Pravin Satav

Reputation: 702

Extracting values from xml file into field delimiter format using unix script/command

Here is sample file and we need to convert values into delimiter formatted file :-

test.xml

<?xml version="1.0" encoding="UTF-8" ?>
 <testjar>

 <testable>
 <trigger>Trigger1</trigger>
 <message>2012-06-14T00:03.54</message>
 <sales-info>
 <san-a>no</san-a>
 <san-b>no</san-b>
 <san-c>no</san-c>
 </sales-info>
 </testable>


  <testable>
  <trigger>Trigger2</trigger>
  <message>2012-06-15T00:03.54</message>
  <sales-info>
  <san-a>yes</san-a>
  <san-b>yes</san-b>
  <san-c>no</san-c>
  </sales-info>
 </testable>

 </testjar>

Each record should start on new line. Sample result set should be something like this sample.txt

Trigger1|2012-06-14T00:03.54|no|no|no  
Trigger2|2012-06-15T00:03.54|yes|yes|no

Note :- xmlstarlet is not installed on my server, is it possible to perform this without xmlstarlet?

Upvotes: 0

Views: 5218

Answers (3)

Shawn Chin
Shawn Chin

Reputation: 86944

If you have xmlstarlet installed, you can try:

me@home$ xmlstarlet sel -t -m "//testable" -v trigger -o "|" -v message -o "|" -m sales-info -v san-a -o "|" -v san-b -o "|" -v san-c -n test.xml
Trigger1|2012-06-14T00:03.54|no|no|no
Trigger2|2012-06-15T00:03.54|yes|yes|no

Breakdown of the command:

xmlstarlet sel -t 
    -m "//testable"       # match <testable>
      -v trigger -o "|"     # print out value of <trigger> followed by |
      -v message -o "|"     # print out value of <message> followed by | 
      -m sales-info         # match <sales-info>
        -v san-a -o "|"       # print out value of <san-a> followed by |
        -v san-b -o "|"       # print out value of <san-b> followed by | 
        -v san-c              # print out value of <san-c>
    -n                   # print new line
    test.xml             # INPUT XML FILE

To target tags that varies within <testable>, you can try the following which returns the text of all leaf nodes:

ma@home$ xmlstarlet sel -t -m "//testable" -m "descendant::*[not(*)]" -v 'text()' -i 'not(position()=last())' -o '|' -b -b -n test.xml 
Trigger1|2012-06-14T00:03.54|no|no|no
Trigger2|2012-06-15T00:03.54|yes|yes|no

Beakdown of the command:

xmlstarlet sel -t 
    -m "//testable"                         # match <testable>
      -m "descendant::*[not(*)]"              # match all leaf nodes
        -v 'text()'                             # print text
        -i 'not(position()=last())' -o '|'      # print | if not last item
        -b -b                                   # break out of nested matches
    -n                                      # print new line
    test.xml                                # INPUT XML FILE

If you do not have access to xmlstarlet, then do look up what other tools you have at your disposal. Other options would include xsltproc (see mzjn's answer) and xpath.

If those tools are not available, I would suggest using a higher level language (Python, Perl) which gives you access to a proper XML library.

While it is possible to parse it manually using regex, such a solution would not be ideal especially with inconsistent inputs. For example, the following (assuming you have gawk and sed) takes your input and should spits out the expected output:

me@home$ gawk 'match($0, />(.*)</, a){printf("%s|",a[1])} /<\/testable>/{print ""}' test.xml | sed 's/.$//'
Trigger1|2012-06-14T00:03.54|no|no|no
Trigger2|2012-06-15T00:03.54|yes|yes|no

However, this would fail miserably if the input format changes and is therefore not a solution I would generally recommend.

Upvotes: 1

Costi Ciudatu
Costi Ciudatu

Reputation: 38255

Here's a pure bash solution:

egrep '<trigger>|<message>|<san-.>' test.xml | sed -e 's/<[^>]*>//g' | while read line; do [ $((++i % 5)) -ne 0 ] && echo -n "$line|" || echo $line ; done

However, it only works on a file formatted as in your sample (each element in a separate row), it's not even closely as flexible / reliable as the other answers involving proper XML parsing / transforming.

It can be enhanced to some extent though...

Upvotes: 1

mzjn
mzjn

Reputation: 51032

Here is an XSLT stylesheet that does what you want (saved in test.xsl):

<?xml version='1.0'?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">

<xsl:output method="text"/>
<xsl:strip-space elements="*"/>

 <xsl:template match="testable">
   <xsl:value-of select='trigger'/><xsl:text>|</xsl:text>
   <xsl:value-of select='message'/><xsl:text>|</xsl:text>
   <xsl:value-of select='sales-info/san-a'/><xsl:text>|</xsl:text>
   <xsl:value-of select='sales-info/san-b'/><xsl:text>|</xsl:text>
   <xsl:value-of select='sales-info/san-c'/><xsl:text>&#xA;</xsl:text>
 </xsl:template>

</xsl:stylesheet>

Command (here I am assuming that you have libxml2 and libxslt installed; xsltproc is a command line tool that uses these libraries):

xsltproc -o sample.txt test.xsl test.xml

Contents of sample.txt:

Trigger1|2012-06-14T00:03.54|no|no|no
Trigger2|2012-06-15T00:03.54|yes|yes|no

Upvotes: 1

Related Questions