SHR
SHR

Reputation: 19

Remove sections of XML file that not matches criteria - using bash

I have large XML file looking something like this:

<A>
    <B id="XXX_City_Oslo">
        <C>
        ....
        </C>
    </B>
    <B id="XXX_City_Bergen">
        <C>
        ....
        </C>
    </B>
    <B id="XXX_City_Trondheim">
        <C>
        ....
        </C>
    </B>
    <B id="XXX_City_Stavanger">
        <C>
        ....
        </C>
    </B>
    <B id="1">
        <C>
        ....
        </C>
    </B>
    <B id="2">
        <C>
        ....
        </C>
    </B>

</A>

I wish to delete some of the sections and its content that contain the string "City". The XML file will be to big do define all the sections that should be deleted. so easier to define what cities that should be kept. The only issue then are all of the section like "1" and "2" that I also want to keep These section does not contain the string "City".

Lets say I want to keep Oslo and Stavanger, using this command:

awk '/<B.*>/ && !/id="XXX_City_Oslo"/ && !/id="XXX_City_Stavanger"/, /<\/B>/ {next} 1' This will then delete all the B sections, but leave Oslo and Stavanger. The issue here is that this will also delete the other B sections that does not contain the string "City". Is it a simple method to only delete the cities that do not match the given input, and also not delete all the sections that doesn't contain the string "City" at all? eg. Something like this(please note the /id="City"/):

awk '/<B.*>/ && **/id="*City*"/** && !/id="XXX_City_Oslo"/ && !/id="XXX_City_Stavanger"/, /<\/B>/ {next} 1'

Please note that this is being ran on a linux environment without much options for adding other scripting languages/libraries and I want to follow the same approach using the awk to solve this.

Thanks in advance for any contribution!

Upvotes: 0

Views: 433

Answers (2)

Zilog80
Zilog80

Reputation: 2562

With AWK, the /../ is a regexp pattern matching expression.

Thus you simply have to add a City filter on top of the others :

awk '/<B.*>/ && !/id="XXX_City_Oslo"/  && !/id="XXX_City_Stavanger"/ && /City/, /<\/B>/ {next} 1'

EDIT: As @EdMorton usefully suggest in comment, you can reduce it to :

awk '/<B.*City.*>/ && !/id="XXX_City_(Oslo|Stavanger)"/ , /<\/B>/ { next } 1'

And if you intend to use that in production script, as @EdMorton states it, you should avoid hard-coding your tag identifiers.

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203209

$ cat tst.awk
BEGIN {
    split(tgts,tmp,/,/)
    for (i in tmp) {
        goodCities["XXX_City_"tmp[i]]
    }
    FS = "\""
    inGoodBlock = 1
}
/^[[:space:]]*<B[[:space:]]*id="/ {
    inGoodBlock = ( ($2 ~ /_City_/) && !($2 in goodCities) ? 0 : 1 )
}
inGoodBlock
/^[[:space:]]*<\/B>/ {
    inGoodBlock = 1
}

$ awk -v tgts='Oslo,Stavanger' -f tst.awk file
<A>
    <B id="XXX_City_Oslo">
        <C>
        ....
        </C>
    </B>
    <B id="XXX_City_Stavanger">
        <C>
        ....
        </C>
    </B>
    <B id="1">
        <C>
        ....
        </C>
    </B>
    <B id="2">
        <C>
        ....
        </C>
    </B>

</A>

$ awk -v tgts='Trondheim' -f tst.awk file
<A>
    <B id="XXX_City_Trondheim">
        <C>
        ....
        </C>
    </B>
    <B id="1">
        <C>
        ....
        </C>
    </B>
    <B id="2">
        <C>
        ....
        </C>
    </B>

</A>

Upvotes: 0

Related Questions