SHR
SHR

Reputation: 19

remove sections of big XML file that not matches criteria - using bash

I have this big XML file(2GB) with alot of parent and children sections looking something like this;

<A>
    <B id="1">
        <C>
        ....
        </C>
    </B>
    <B id="2">
        <C>
        ....
        </C>
    </B>
    <B id="3">
        <C>
        ....
        </C>
    </B>
    <B id="4">
        <C>
        ....
        </C>
    </B>
</A>

Removing a given section(and its child sections) by a given ID can easily be done like this: awk '/<B id="1">+$/,/<\/B>+$/{next}1' oldfile.xml > newfile.xml #This will delete the section where B ID = 1

But, how to do this the other way around? Delete all the B sections(and its child sections) that don't matches the given ID? eg something like this would be great(note the != instead of =): awk '/<B id**!**="1" or id!="2">+$/,/<\/B>+$/{next}1' oldfile.xml > newfile.xml #This should then delete B ID=3, B ID=4 and so on.

Please note that this is being ran on a linux environment without much options for adding other scripting languages/libraries

Upvotes: 0

Views: 51

Answers (1)

Socowi
Socowi

Reputation: 27215

The best way to work on your xml file would be something like xpath. However, if you want to stick to your current approach and know that the file is always formatted as in your question then there is an easy way to adapt the command.

You can combine multiple checks using &&. To test for a non-match, use !/regex/. Both of these work even in ranges like /start/,/end/:

awk '/<B.*>/ && !/id="1"/ , /<\/B>/ {next} 1'

Upvotes: 1

Related Questions