I. Iudice
I. Iudice

Reputation: 55

Extract specific keywords from XML file with bash script

I have an XML file containing some entries characterized by specific kaywords. I need to run a for loop on the entries, extract two different keywords for each for them to be used in the for loop as variables.

Here is an example of list.xml:

<?xml version="1.0" encoding="UTF-8"?>
<responses type="C-FIND">
  <data-set xfer="1.2.840.10008.1.2.1" name="Little Endian Explicit">
    <element tag="0008,0005" vr="CS" vm="1" len="10" name="SpecificCharacterSet">ISO_IR 192</element>
    <element tag="0008,0052" vr="CS" vm="1" len="6" name="QueryRetrieveLevel">STUDY</element>
    <element tag="0008,0054" vr="AE" vm="1" len="8" name="RetrieveAETitle">PLATONE</element>
    <element tag="0010,0010" vr="PN" vm="1" len="16" name="PatientName">Anon^1600373003</element>
    <element tag="0020,000d" vr="UI" vm="1" len="42" name="StudyInstanceUID">1.3.76.13.99972.2.20181217085753.1484038.1</element>
  </data-set>
  <data-set xfer="1.2.840.10008.1.2.1" name="Little Endian Explicit">
    <element tag="0008,0005" vr="CS" vm="1" len="10" name="SpecificCharacterSet">ISO_IR 192</element>
    <element tag="0008,0052" vr="CS" vm="1" len="6" name="QueryRetrieveLevel">STUDY</element>
    <element tag="0008,0054" vr="AE" vm="1" len="8" name="RetrieveAETitle">PLATONE</element>
    <element tag="0010,0010" vr="PN" vm="1" len="16" name="PatientName">Anon^1599844862</element>
    <element tag="0020,000d" vr="UI" vm="1" len="42" name="StudyInstanceUID">1.3.76.13.99972.2.20180925142630.1456727.1</element>
  </data-set>
</responses>

I need to extract the keywords "PatientName" and "StudyInstanceUID". I tried to use something like this:

grep -A2 -i "PatientName" list.xml | while read -r string ; do
    PatientName="$(echo $string | grep -i "PatientName" | cut -d ">" -f 2 | cut -d "<" -f 1)"
    StudyInstanceUID="$(echo $string | grep -i "StudyInstanceUID" | cut -d ">" -f 2 | cut -d "<" -f 1)"
    echo "$PatientName"
    echo "$StudyInstanceUID"
done

The problem is that I obtain a lot of empty rows! What's the problem?

[EDIT] What I would like to obtain from this example is something like this:

Anon^1600373003
1.3.76.13.99972.2.20181217085753.1484038.1
Anon^1599844862
1.3.76.13.99972.2.20180925142630.1456727.1

Thanks so much.

Ivan

Upvotes: 2

Views: 1858

Answers (3)

Reino
Reino

Reputation: 3423

awk and sed are not designed to process XML. Please use a dedicated tool instead. I can recommend .

Stdout:

$ xidel -s list.xml -e '
  //data-set/(
    element[@name="PatientName"],
    element[@name="StudyInstanceUID"]
  )
'
Anon^1600373003
1.3.76.13.99972.2.20181217085753.1484038.1
Anon^1599844862
1.3.76.13.99972.2.20180925142630.1456727.1

Variables:

$ xidel -s list.xml -e '
  //data-set/(
    eval(x"{concat("pn",position())}:=element[@name=""PatientName""]")[0],
    eval(x"{concat("si",position())}:=element[@name=""StudyInstanceUID""]")[0]
  )
'
pn1 := Anon^1600373003
si1 := 1.3.76.13.99972.2.20181217085753.1484038.1
pn2 := Anon^1599844862
si2 := 1.3.76.13.99972.2.20180925142630.1456727.1

These are internal variables that are just printed to stdout. Use --output-format=bash and Bash's built-in eval command to convert them to shell variables.

$ eval $(xidel -s list.xml -e '
  //data-set/(
    eval(x"{concat("pn",position())}:=element[@name=""PatientName""]")[0],
    eval(x"{concat("si",position())}:=element[@name=""StudyInstanceUID""]")[0]
  )
' --output-format=bash)

$ printf '%s\n' $pn1 $si1 $pn2 $si2
Anon^1600373003
1.3.76.13.99972.2.20181217085753.1484038.1
Anon^1599844862
1.3.76.13.99972.2.20180925142630.1456727.1

Upvotes: 0

markp-fuso
markp-fuso

Reputation: 34334

As Raman alluded to in the comment, using a XML-aware tool to parse XML data is probably your best bet especially if some of your XML may not be as formatted as displayed in the question (eg, everything on one long line).

Assumptions:

  • you can confirm all of your data will be formatted as like the samples in the question (ie, each element is on a separate line)
  • the search strings PatientName and StudyInstanceUID do not show up in larger strings (eg, LastPatientName or PreviousStudyInstanceUID)
  • the PatientName element is always listed before the StudyInstanceUID element

One awk solution which eliminates the need for all of the sub-process calls to echo, grep and cut:

awk -F'[<>]' '                                    # define input field separators as "<" and ">"
/PatientName/ || /StudyInstanceUID/ { print $3 }  # if we find one of our search strings then print field #3
' list.xml

The same as a one-liner, sans comments:

awk -F'[<>]' '/PatientName/ || /StudyInstanceUID/ { print $3 }' list.xml

The above generates:

Anon^1600373003
1.3.76.13.99972.2.20181217085753.1484038.1
Anon^1599844862
1.3.76.13.99972.2.20180925142630.1456727.1

As for capturing the output into variables (eg, within a while loop), we can make some small changes, eg:

awk -F'[<>]' '
/PatientName/      { pn=$3 }                      # store field #3 in variable "pn"
/StudyInstanceUID/ { printf "%s %s\n", pn, $3 }   # print data to stdout
' list.xml

This will generate:

Anon^1600373003 1.3.76.13.99972.2.20181217085753.1484038.1
Anon^1599844862 1.3.76.13.99972.2.20180925142630.1456727.1

Feeding this into a while loop:

while read -r PatientName StudyInstanceUID
do
    echo "+++++++++++++++++++"
    echo "PatientName:      ${PatientName}"
    echo "StudyInstanceUID: ${StudyInstanceUID}"
done < <(awk -F'[<>]' ' /PatientName/ { pn=$3 } /StudyInstanceUID/ { printf "%s %s\n", pn, $3 } ' list.xml)

And this generates:

+++++++++++++++++++
PatientName:      Anon^1600373003
StudyInstanceUID: 1.3.76.13.99972.2.20181217085753.1484038.1
+++++++++++++++++++
PatientName:      Anon^1599844862
StudyInstanceUID: 1.3.76.13.99972.2.20180925142630.1456727.1

Upvotes: 1

Lety
Lety

Reputation: 2603

Command:

grep -A2 -i "PatientName" list.xml

returns multiple lines:

    <element tag="0010,0010" vr="PN" vm="1" len="16" name="PatientName">Anon^1600373003</element>
    <element tag="0020,000d" vr="UI" vm="1" len="42" name="StudyInstanceUID">1.3.76.13.99972.2.20181217085753.1484038.1</element>
  </data-set>
--
    <element tag="0010,0010" vr="PN" vm="1" len="16" name="PatientName">Anon^1599844862</element>
    <element tag="0020,000d" vr="UI" vm="1" len="42" name="StudyInstanceUID">1.3.76.13.99972.2.20180925142630.1456727.1</element>
  </data-set>

so your while, read this output line by line. The result you get is correct because on line:

<element tag="0010,0010" vr="PN" vm="1" len="16" name="PatientName">Anon^1600373003</element>

StudyInstanceUID is not present and your variable will be empty.

In order to get the desired result, try this:

grep -A1 -i "PatientName" list.xml | while read -r string ; do
    PatientName="$(echo $string | grep -i "PatientName" | cut -d ">" -f 2 | cut -d "<" -f 1)"
    read string
    StudyInstanceUID="$(echo $string | grep -i "StudyInstanceUID" | cut -d ">" -f 2 | cut -d "<" -f 1)"
    echo "$PatientName"
    echo "$StudyInstanceUID"
    read string
done

Using read string you will get next line, but becareful, this works if lines are in that order.

Upvotes: 0

Related Questions