PCs
PCs

Reputation: 5

How to use perl to filter XML file with tags?

I have a big XML file (2 GB) and that file contains too many useless data that need to be filtered, below is the rough structure of the XML file:

(All the useless data are replaced by "useless_information" to make it looks clean and tidy)

<hmdb>
    <metabolite>
        <useless_information></useless_information>
        <useless_information></useless_information>
        <useless_information></useless_information>
        <useless_information></useless_information>
        ...
        <normal_concentrations>
            <useless_information></useless_information>
            <useless_information></useless_information>
            <useless_information></useless_information>
            ...
            <concentration>
                <useless_information></useless_information>
                <useless_information></useless_information>
                <useless_information></useless_information>
                <useless_information></useless_information>
                ...
                <concentration_value> 100 </concentration_value>
                <subject_age> 21 </subject_age>
                <subject_sex> male </subject_sex>
            </concentration>
            <concentration></concentration>
            <concentration></concentration>
            <concentration></concentration>
            ...
        </normal_concentrations>
    </metabolite>
    <metabolite></metabolite>
    <metabolite></metabolite>
    <metabolite></metabolite>
    <metabolite></metabolite>
    ...
</hmdb>

So, basically I would like to keep the following tags and values: concentration_value, subject_age and subject_sex, the rest are all not important and can be filtered, the XML should look like this after filter it:

<hmdb>
    <metabolite>
        <concentration>
            <concentration_value> 100 </concentration_value>
            <subject_age> 21 </subject_age>
            <subject_sex> male </subject_sex>
        </concentration>
        <concentration></concentration>
        <concentration></concentration>
        <concentration></concentration>
        ...
    </metabolite>
    <metabolite></metabolite>
    <metabolite></metabolite>
    <metabolite></metabolite>
    <metabolite></metabolite>
    ...
</hmdb>

I need the data in this file to continue my study (This file is too big, my laptop cannot open this file, so I have to filter out the useless data to decrease the size of the XML file before I use it), but I don't know how to write perl script, I'm really appreciate your help, thank you so much :)

Upvotes: 0

Views: 101

Answers (2)

daxim
daxim

Reputation: 39158

file contains too many useless data that need to be filtered

http://p3rl.org/xml_grep

Upvotes: 0

Siegfried
Siegfried

Reputation: 705

Assuming the sample of your data you have is representative (that is, all the matching tags for useless_information are on the same line) and assuming your input data is in a file called input-data.xml, the following one line perl program should work. I tested it with your sample. So at the bash (or for windows, cmd.exe) command line type this

perl -nle 'while(<stdin>){if(!/useless_information/){chop; print}}' <input-data.xml >output-data.xml

This little one line program will ignore any line that contains "useless_information" and assumes that matching tag for is always on the same line.

However, since I suspect that there may be several useless tags you want to ignore, it might be more effective to filter for what you want instead of what you don't want.

perl -nle 'while(<stdin>){if(/metobolite|normal_concentrations|concentration_value|subject_age|subject_sex|concentration/){chop; print}}' <input-data.xml >output-data.xml

This also assumes that you have perl installed and it (the executable for perl) is in your environment variable called "PATH".

Now if you find out that sometimes it (the matching tag) is not on the same line, then we will have to get a little fancier.

HTH!

Upvotes: 2

Related Questions