Deminem
Deminem

Reputation: 672

Extract text from XML tags using sed - shell script

Well I have already write the script which basically takes xml file as input and extract the text for specific XML tags and it's working. But it's not smart enough to get the multiline text and also allow special characters. It's very important that text format should be keep intact as it's defined under tags.

Below is the XML input:

<nick>Deminem</nick>
<company>XYZ Solutions</company>
<description>
  /**
   * 
   *  «Lorem» ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy
   *  tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. 
   *  At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd 
   *  no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit 
   *  consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore
   *  magna aliquyam erat, sed diam voluptua.
   *
   **/
</description> 

The above script extract the text of each specific tag and assign to new valueArray. My command over sed is basic but always willing to go the extra mile.

tagsArray=( nick company description )
noOfElements=${#tagsArray[@]}

for (( i=0;i<$noOfElements;i++)); do

OUT=`grep ${tagsArray[${i}]} filename.xml | tr -d '\t' | sed -e 's/^<.*>\([^<].*\)<.*>$/\1/' `

valueArray[${i}]=${OUT}
done 

Upvotes: 1

Views: 6649

Answers (2)

Sanjay
Sanjay

Reputation: 1

#!/bin/sh
filePath=$1 #XML file path
tagName=$2  #Tag name to fetch values
awk '!/<.*>/' RS="<"$tagName">|</"$tagName">" $filePath

Upvotes: 0

Anders Lindahl
Anders Lindahl

Reputation: 42870

Parsing XML with regexp leads to trouble eventually, just as you have experienced. Take the time to learn enough XSL (there are many tutorials) to transform the XML properly, using for example xsltproc.

Edit:

After trying out a few command line xml utilities, I think xmlstarlet could be the tool for you. The following is untested, and assumes that filename.xml is a proper xml file (i.e. has a single root element).

tagsArray=( nick company description )
noOfElements=${#tagsArray[@]}

for (( i=0;i<$noOfElements;i++)); do
    valueArray[${i}] = `xmlstarlet sel -t -v "/root/$tagsArray[i]" filename.xml`
done

Upvotes: 3

Related Questions