Reputation: 5
I have a big XML file (2 GB) and that file contains too many useless data that need to be filtered, below is the rough structure of the XML file:
(All the useless data are replaced by "useless_information" to make it looks clean and tidy)
<hmdb>
<metabolite>
<useless_information></useless_information>
<useless_information></useless_information>
<useless_information></useless_information>
<useless_information></useless_information>
...
<normal_concentrations>
<useless_information></useless_information>
<useless_information></useless_information>
<useless_information></useless_information>
...
<concentration>
<useless_information></useless_information>
<useless_information></useless_information>
<useless_information></useless_information>
<useless_information></useless_information>
...
<concentration_value> 100 </concentration_value>
<subject_age> 21 </subject_age>
<subject_sex> male </subject_sex>
</concentration>
<concentration></concentration>
<concentration></concentration>
<concentration></concentration>
...
</normal_concentrations>
</metabolite>
<metabolite></metabolite>
<metabolite></metabolite>
<metabolite></metabolite>
<metabolite></metabolite>
...
</hmdb>
So, basically I would like to keep the following tags and values: concentration_value, subject_age and subject_sex, the rest are all not important and can be filtered, the XML should look like this after filter it:
<hmdb>
<metabolite>
<concentration>
<concentration_value> 100 </concentration_value>
<subject_age> 21 </subject_age>
<subject_sex> male </subject_sex>
</concentration>
<concentration></concentration>
<concentration></concentration>
<concentration></concentration>
...
</metabolite>
<metabolite></metabolite>
<metabolite></metabolite>
<metabolite></metabolite>
<metabolite></metabolite>
...
</hmdb>
I need the data in this file to continue my study (This file is too big, my laptop cannot open this file, so I have to filter out the useless data to decrease the size of the XML file before I use it), but I don't know how to write perl script, I'm really appreciate your help, thank you so much :)
Upvotes: 0
Views: 101
Reputation: 705
Assuming the sample of your data you have is representative (that is, all the matching tags for useless_information are on the same line) and assuming your input data is in a file called input-data.xml, the following one line perl program should work. I tested it with your sample. So at the bash (or for windows, cmd.exe) command line type this
perl -nle 'while(<stdin>){if(!/useless_information/){chop; print}}' <input-data.xml >output-data.xml
This little one line program will ignore any line that contains "useless_information" and assumes that matching tag for is always on the same line.
However, since I suspect that there may be several useless tags you want to ignore, it might be more effective to filter for what you want instead of what you don't want.
perl -nle 'while(<stdin>){if(/metobolite|normal_concentrations|concentration_value|subject_age|subject_sex|concentration/){chop; print}}' <input-data.xml >output-data.xml
This also assumes that you have perl installed and it (the executable for perl) is in your environment variable called "PATH".
Now if you find out that sometimes it (the matching tag) is not on the same line, then we will have to get a little fancier.
HTH!
Upvotes: 2