Reputation: 8505
I have few very large(10GB) xml files, with the following structure. As you can see the file contains series of records. what i would like to do is search record based on one or multiple properties. The problem is that a simple grep would give me the line containing the property. So for example, grep might give me a line 100, line 300 and so on. But, i require is the capability to extract the relevant record element, not just the lines that met the match. Are there any unix utilities that can help?
<records>
<record seq="1">
<properties>
<property name="AssetId">1234</property>
</properties>
<message>messsage1</message>
</record>
<record seq="2">
<properties>
<property name="VI-ID">4567</property>
</properties>
<message>message2</message>
</record>
<records>
Upvotes: 0
Views: 4722
Reputation: 295649
xmlstarlet
allows you to run XPath from shell scripts; this is a perfect use case.
For instance:
xmlstarlet sel -t \
-m '//record[properties/property[@name="AssetId"][text()="1234"]]' \
-c .
will print the entire record having an AssetId property of 1234.
If you want to do multiple matches within one pass, this is supported too:
xmlstarlet sel \
-t -m '//record[properties/property[@name="AssetId"][text()="1234"]]' \
-c . -n -n \
-t -m '//record[properties/property/@name="VI-ID"]' \
-c . -n -n \
<input.xml
...this version will print either a record with an AssetID of 1234, or any record with a VI-ID
present with any value, and put two newlines after each record emitted.
Upvotes: 4
Reputation: 28049
In case you only want to use basic unix tools, here's a (stupid) little sed script that can extract out a property that is either on one line, or that spans multiple lines:
sed -n '
/<open>[^<]*<\/open>/ {
p
b
}
/<open>/,/<\/open>/ {
p
}' file.xml
Sample input:
<open>stuff</open>
<otherTag>
otherstuff
</otherTag>
<open>
morestuff
</open>
<otherTag>astlkj</otherTag>
Sample output:
<open>stuff</open>
<open>
morestuff
</open>
Not up for production use: if a tag has multiple attributes, this method quickly becomes difficult, cumbersome, and, if the xml is convoluted enough, impossible. But it ought to do for parsing out information here and there.
Upvotes: 0
Reputation: 49118
Probably the simplest way is to use the -C
option to grep
. It will print the specified number of lines around each match. Yes, it won't stop exactly on a record boundary, but usually just insuring it's included is enough for my needs.
Upvotes: 0