Jimm
Jimm

Reputation: 8505

Search within xml file(s) in linux

I have few very large(10GB) xml files, with the following structure. As you can see the file contains series of records. what i would like to do is search record based on one or multiple properties. The problem is that a simple grep would give me the line containing the property. So for example, grep might give me a line 100, line 300 and so on. But, i require is the capability to extract the relevant record element, not just the lines that met the match. Are there any unix utilities that can help?

<records>
 <record seq="1">
  <properties>
   <property name="AssetId">1234</property>
  </properties>
 <message>messsage1</message>
</record>
<record seq="2">
 <properties>
  <property name="VI-ID">4567</property>
 </properties>
 <message>message2</message>
</record>
<records>

Upvotes: 0

Views: 4722

Answers (3)

Charles Duffy
Charles Duffy

Reputation: 295649

xmlstarlet allows you to run XPath from shell scripts; this is a perfect use case.

For instance:

xmlstarlet sel -t \
  -m '//record[properties/property[@name="AssetId"][text()="1234"]]' \
  -c .

will print the entire record having an AssetId property of 1234.

If you want to do multiple matches within one pass, this is supported too:

xmlstarlet sel \
  -t -m '//record[properties/property[@name="AssetId"][text()="1234"]]' \
     -c . -n -n \
  -t -m '//record[properties/property/@name="VI-ID"]' \
     -c . -n -n \
  <input.xml

...this version will print either a record with an AssetID of 1234, or any record with a VI-ID present with any value, and put two newlines after each record emitted.

Upvotes: 4

Tim Pote
Tim Pote

Reputation: 28049

In case you only want to use basic unix tools, here's a (stupid) little sed script that can extract out a property that is either on one line, or that spans multiple lines:

sed -n '
/<open>[^<]*<\/open>/ {
  p
  b
}

/<open>/,/<\/open>/ {
  p
}' file.xml

Sample input:

<open>stuff</open>
<otherTag>
otherstuff
</otherTag>
<open>
morestuff
</open>
<otherTag>astlkj</otherTag>

Sample output:

<open>stuff</open>
<open>
morestuff
</open>

Not up for production use: if a tag has multiple attributes, this method quickly becomes difficult, cumbersome, and, if the xml is convoluted enough, impossible. But it ought to do for parsing out information here and there.

Upvotes: 0

Karl Bielefeldt
Karl Bielefeldt

Reputation: 49118

Probably the simplest way is to use the -C option to grep. It will print the specified number of lines around each match. Yes, it won't stop exactly on a record boundary, but usually just insuring it's included is enough for my needs.

Upvotes: 0

Related Questions