remove sections of big XML file that not matches criteria - using bash

Question

I have this big XML file(2GB) with alot of parent and children sections looking something like this;

Removing a given section(and its child sections) by a given ID can easily be done like this: awk '/+$/,/<\/B>+$/{next}1' oldfile.xml > newfile.xml #This will delete the section where B ID = 1

But, how to do this the other way around? Delete all the B sections(and its child sections) that don't matches the given ID? eg something like this would be great(note the != instead of =): awk '/+$/,/<\/B>+$/{next}1' oldfile.xml > newfile.xml #This should then delete B ID=3, B ID=4 and so on.

Please note that this is being ran on a linux environment without much options for adding other scripting languages/libraries

Socowi · Accepted Answer

The best way to work on your xml file would be something like xpath. However, if you want to stick to your current approach and know that the file is always formatted as in your question then there is an easy way to adapt the command.

You can combine multiple checks using &&. To test for a non-match, use !/regex/. Both of these work even in ranges like /start/,/end/:

awk '// && !/id="1"/ , /<\/B>/ {next} 1'

remove sections of big XML file that not matches criteria - using bash

Answers (1)

Related Questions