ChuckMac
ChuckMac

Reputation: 432

Sorting & Merging XML Documents with Perl / XML::Twig

I have many XML files in a directory that need to sorted and merged into one file. The files are formatted as follows:

File1.xml:

<?xml version="1.0" encoding="utf-8"?>
<doctypea>
  <header someattr="1">
    <docnumber>111</docnumber>
  </header>
</doctypea>

File2.xml:

<?xml version="1.0" encoding="utf-8"?>
<doctypea>
  <header someattr="1">
    <docnumber>112</docnumber>
  </header>
</doctypea>

File3.xml:

<?xml version="1.0" encoding="utf-8"?>
<doctypeb>
  <header someattr="1">
    <docnumber>111</docnumber>
  </header>
</doctypeb>

File4.xml:

<?xml version="1.0" encoding="utf-8"?>
<doctypeb>
  <header someattr="1">
    <docnumber>112</docnumber>
  </header>
</doctypeb>

All the files in this directory need to be sorted on the following criteria:

  1. documentnumber
  2. doctype (a or b)

Then they need to be merged, so the output file should look like:

<?xml version="1.0" encoding="utf-8"?>
<doctypea>
  <header someattr="1">
    <docnumber>111</docnumber>
  </header>
</doctypea>
<doctypeb>
  <header someattr="1">
    <docnumber>111</docnumber>
  </header>
</doctypeb>
<doctypea>
  <header someattr="1">
    <docnumber>112</docnumber>
  </header>
</doctypea>
<doctypeb>
  <header someattr="1">
    <docnumber>112</docnumber>
  </header>
</doctypeb>

In order to accomplish this I am trying to use XML:Twig in Perl. I have the following code so far:

use XML::Twig;

my $xmldir = "/xmlfiles";
my $parser = XML::Twig->new(pretty_print => 'indented');

opendir(DIR, "$xmldir");
my @FILES= readdir(DIR);
closedir(DIR);

foreach (@FILES) {
        if ($_ ne "." && $_ ne "..") {
                print "reading file: $xmldir/$_\n";
                $parser->parsefile("$xmldir/$_");
        }
}

At this point I cannot seem to figure out the correct syntax to get the elements I want from the parser.

1. How do I get the value of the root element ("doctypea" or "doctypeb")?

2. I assume I need that (1) in order to parsenode down to the docnumber field?

My plan then is to build some kind of has with doctype%number in order to sort, I am not sure the easiest way to merge them with that.

Appreciate any advise!

Upvotes: 1

Views: 1337

Answers (2)

Jeff Burdges
Jeff Burdges

Reputation: 4261

As daxim noticed, your files aren't valid XML, but you could process them using regular expressions. If the files aren't too big, you could slurp the files into individual strings which you sort based upon their contents.

use File::Slurp qw( read_dir ) ;
my $xmldir=".";
my %files = map {
        s/^.*$//m; 
        /<doctype([ab])>/; my $x=ord($1) - ord('a');
        /<docnumber>(\d+)</docnumber>/; $x += 10*$2;
        $x => $_
    } read_dir($xmldir);
print join("", map { $files{$_} } sort keys %files);

I have not debugged this code. Also print join("", values %files); might work.

Upvotes: 1

bvr
bvr

Reputation: 9697

Please find below small example that should get you started. It shows how to get data from XML file similar to yours (I fixed the tags to match and quoted someattr value to get valid XML). You can use similar approach to gather data you need and produce output.

use XML::Twig;

XML::Twig->new(twig_handlers => {
    '/*'        => sub { print $_->gi;           },     # doctypea
    'docnumber' => sub { print $_->trimmed_text; },     # 111
})->parse(\*DATA);    # use parsefile('xxx.xml') to parse a file

__DATA__
<?xml version="1.0" encoding="utf-8"?>
<doctypea>
  <header someattr="1">
    <docnumber>111</docnumber>
  </header>
</doctypea>

Upvotes: 5

Related Questions