biji
biji

Reputation: 59

how to remove duplicate nodes from xml file using perl

I am creating one xml file from multiple, I need to remove duplicate nodes form output xml. I have script like this to generate new xml file

 #!/usr/bin/perl
 use warnings;
 use strict;
 use XML::LibXML;
 use Carp;
 use File::Find;
 use File::Spec::Functions qw( canonpath );
 use XML::LibXML::Reader;
 use Digest::MD5 'md5';

 if ( @ARGV == 0 ) {
     push @ARGV, "c:/main/sav ";
     warn "Using default path $ARGV[0]\n  Usage: $0  path ...\n";
 }

 open( my $allxml, '>', "combined.xml" )
     or die "can't open output xml file for writing: $!\n";
 print $allxml '<?xml version="1.0" encoding="UTF-8"?>',
  "\n<Datainfo xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">\n";
 my %extract_md5;
 find(
      sub {
          return unless ( /(_str\.xml)$/ and -f );
          extract_information();
          return;
      },
      @ARGV
     );

 print $allxml "</Datainfo>\n";

 sub extract_information {
     my $path = $_;
     if ( my $reader = XML::LibXML::Reader->new( location => $path )) {
         while ( $reader->nextElement( 'Data' )) {
             my $elem = $reader->readOuterXml();
             my $md5 = md5( $elem );
             print $allxml $reader->readOuterXml() unless ( $extract_md5{$md5}++ );
         }

     }
     return;
 }

But from above script printing xml file like this

combined.xml:

<?xml version="1.0" encoding="UTF-8"?>
<Datainfo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <data>
        <test>22</test>
        <info>sensor value</info>
        <sensor>
            <sensor value="23" temp="25"/>
        </sensor>
    </data>
    <data>
        <test>23</test>
        <info>sensor value</info>
        <sensor>
            <sensor value="24" temp="27"/>
        </sensor>
    </data>
    <data>
        <test>22</test>
        <info>sensor value</info>
        <sensor>
            <sensor value="22" temp="26"/>
        </sensor>
    </data>
</Datainfo>

In the above xml file I have data element test(22) is repeated in two times. I need to use test as the element to search in file if same test number is found what ever may be the information inside that node I need to delete that entire node information. I tried to do with md5 but it removing duplicate nodes from allxml files but now I need to search one specific element and delete entire node information if duplicate is occurred.please help me with this problem.
output like this

combined.xml:

<?xml version="1.0" encoding="UTF-8"?>
<Datainfo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <data>
        <test>22</test>
        <info>sensor value</info>
        <sensor>
            <sensor value="23" temp="25"/>
        </sensor>
    </data>
    <data>
        <test>23</test>
        <info>sensor value</info>
        <sensor>
            <sensor value="24" temp="27"/>
        </sensor>
    </data>
</Datainfo>

Upvotes: 0

Views: 1512

Answers (2)

David W.
David W.

Reputation: 107060

I normally use XML::Simple for things like this.

XML::Simple stores your XML file in a hash/array structure. This would automatically eliminate the duplicate issue you're finding (depending how you configure it).

Upvotes: 1

DVK
DVK

Reputation: 129481

You will have to do the duplicate checking by specifically checking <test> contents, instead of md5 of the entire node.

E.g. instead of my $md5 = md5( $elem ); and storing $md5 key in the hash, you need to extract the contents of <test> tag and store that.

I would prefer not to provide more details since you seem to be simply spamming SO as well as PerlMonks with requests to help you do your work and copying/pasting somewhat complicated code that you don't bother trying to understand how it works.

http://www.perlmonks.org/?node_id=939272

Upvotes: 0

Related Questions