Reputation: 19
I have large XML file looking something like this:
<A>
<B id="XXX_City_Oslo">
<C>
....
</C>
</B>
<B id="XXX_City_Bergen">
<C>
....
</C>
</B>
<B id="XXX_City_Trondheim">
<C>
....
</C>
</B>
<B id="XXX_City_Stavanger">
<C>
....
</C>
</B>
<B id="1">
<C>
....
</C>
</B>
<B id="2">
<C>
....
</C>
</B>
</A>
I wish to delete some of the sections and its content that contain the string "City". The XML file will be to big do define all the sections that should be deleted. so easier to define what cities that should be kept. The only issue then are all of the section like "1" and "2" that I also want to keep These section does not contain the string "City".
Lets say I want to keep Oslo and Stavanger, using this command:
awk '/<B.*>/ && !/id="XXX_City_Oslo"/ && !/id="XXX_City_Stavanger"/, /<\/B>/ {next} 1'
This will then delete all the B sections, but leave Oslo and Stavanger. The issue here is that this will also delete the other B sections that does not contain the string "City".
Is it a simple method to only delete the cities that do not match the given input, and also not delete all the sections that doesn't contain the string "City" at all? eg. Something like this(please note the /id="City"/):
awk '/<B.*>/ && **/id="*City*"/** && !/id="XXX_City_Oslo"/ && !/id="XXX_City_Stavanger"/, /<\/B>/ {next} 1'
Please note that this is being ran on a linux environment without much options for adding other scripting languages/libraries and I want to follow the same approach using the awk to solve this.
Thanks in advance for any contribution!
Upvotes: 0
Views: 433
Reputation: 2562
With AWK
, the /../
is a regexp pattern matching expression.
Thus you simply have to add a City
filter on top of the others :
awk '/<B.*>/ && !/id="XXX_City_Oslo"/ && !/id="XXX_City_Stavanger"/ && /City/, /<\/B>/ {next} 1'
EDIT: As @EdMorton usefully suggest in comment, you can reduce it to :
awk '/<B.*City.*>/ && !/id="XXX_City_(Oslo|Stavanger)"/ , /<\/B>/ { next } 1'
And if you intend to use that in production script, as @EdMorton states it, you should avoid hard-coding your tag identifiers.
Upvotes: 1
Reputation: 203209
$ cat tst.awk
BEGIN {
split(tgts,tmp,/,/)
for (i in tmp) {
goodCities["XXX_City_"tmp[i]]
}
FS = "\""
inGoodBlock = 1
}
/^[[:space:]]*<B[[:space:]]*id="/ {
inGoodBlock = ( ($2 ~ /_City_/) && !($2 in goodCities) ? 0 : 1 )
}
inGoodBlock
/^[[:space:]]*<\/B>/ {
inGoodBlock = 1
}
$ awk -v tgts='Oslo,Stavanger' -f tst.awk file
<A>
<B id="XXX_City_Oslo">
<C>
....
</C>
</B>
<B id="XXX_City_Stavanger">
<C>
....
</C>
</B>
<B id="1">
<C>
....
</C>
</B>
<B id="2">
<C>
....
</C>
</B>
</A>
$ awk -v tgts='Trondheim' -f tst.awk file
<A>
<B id="XXX_City_Trondheim">
<C>
....
</C>
</B>
<B id="1">
<C>
....
</C>
</B>
<B id="2">
<C>
....
</C>
</B>
</A>
Upvotes: 0