Reputation: 2662
I have an original.xml
file structured like so:
<root>
<parent attr1="val1" attr2="val2" ... >
<child key1="val3" key2="val4" ... />
<child key1="val5" key2="val6" ... />
...
</parent>
...
<parent attr1="val7" attr2="val8" ... />
...
</root>
For each parent
node of this file I need to create a new file, name it according to the value of attr1
attribute and insert this data:
<newroot>
<newparent attr1="val1" attr2="val2" ... />
</newroot>
For that, I'm using this perl one-line command:
perl -p -i -e 'open(F, ">", "new/".($1).".xml") if /<parent attr1="(.*)" attr2="(.*)" ... /; print {F} "<newroot><newparent attr1=\"".($1)."\" attr2=\"".($2)."\" /></newroot>";' "original.xml"
This works good enough for all the original parent
s with no child
ren. But for each parent
that has child
ren (let's say the first parent
in my example), it multiplies the output for as many rows as the parent
has, like so:
<newroot>
<newparent attr1="val1" attr2="val2" ... />
</newroot>
<newroot>
<newparent attr1="" attr2="" ... />
</newroot>
<newroot>
<newparent attr1="" attr2="" ... />
</newroot>
...
<newroot>
<newparent attr1="" attr2="" ... />
</newroot>
I don't quite understand why this happens. How do I make my perl command output only one newroot
element with the needed data?
Upvotes: 0
Views: 112
Reputation: 167516
I would suggest to use XSLT to solve that, for instance LibXSLT has support for exsl:document
(see http://exslt.org/exsl/elements/document/index.html) and that way you can write an XSLT stylesheet doing
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:exsl="http://exslt.org/common"
extension-element-prefixes="exsl"
exclude-result-prefixes="exsl"
version="1.0">
<xsl:template match="/">
<xsl:apply-templates select="root/parent[@attr1]" mode="new"/>
</xsl:template>
<xsl:template match="parent" mode="new">
<xsl:message>Writing file <xsl:value-of select="@attr1"/></xsl:message>
<exsl:document href="{@attr1}.xml" method="xml" indent="yes">
<newroot>
<xsl:apply-templates select="."/>
</newroot>
</exsl:document>
</xsl:template>
<xsl:template match="parent">
<newparent>
<xsl:copy-of select="@*"/>
</newparent>
</xsl:template>
</xsl:stylesheet>
which splits the XML input document, creating one result file for each root/parent
element with an attr1
attribute, creating a new root in those files named newroot
, transforming the parent
element to newparent
and copying its attributes.
You can run a stylesheet in Perl using
use XML::LibXSLT;
use XML::LibXML;
my $xslt = XML::LibXSLT->new();
my $source = XML::LibXML->load_xml(location => 'original.xml');
my $style_doc = XML::LibXML->load_xml(location => 'sheet1.xsl');
my $stylesheet = $xslt->parse_stylesheet($style_doc);
my $results = $stylesheet->transform($source);
print $stylesheet->output_as_bytes($results);
As you have indicated in your comment that you have run into memory and/or performance problems, here is an alternative approach making use of LibXML::Reader
which is a forwards reading pull parser and does not load the complete XML into an in memory tree structure:
use strict;
use warnings;
use XML::LibXML::Reader;
use XML::LibXML;
my $reader = XML::LibXML::Reader->new(location => "input.xml")
or die "cannot read file.xml\n";
while ($reader->read) {
processNode($reader);
}
sub processNode {
my $reader = shift;
if ($reader->nodeType == XML_READER_TYPE_ELEMENT && $reader->name eq "parent")
{
my $clone = $reader->copyCurrentNode(0);
$clone->setName('newparent');
my $doc = XML::LibXML::Document->new( );
$doc->setDocumentElement($doc->createElement('newroot'));
$doc->documentElement()->appendChild($clone);
my $filePrefix = $clone->getAttribute('attr1');
my $fileName = "$filePrefix-result.xml";
print "Writing file $fileName.\n";
$doc->toFile($fileName, 1);
}
}
It should avoid the memory problems, I hope.
Upvotes: 1
Reputation: 53478
Don't use a regex. Use a parser. I quite like XML::Twig
. (XML::LibXML
is pretty good too).
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new->parsefile('your_source.xml');
foreach my $parent_elt ( $twig->findnodes('//parent') ) {
#remove children if that's what you want?
$_->delete for $parent_elt->children();
my $newname = $parent_elt->att('attr1');
print "Opening:", $newname, "\n";
my $new_doc = XML::Twig->new->set_root( XML::Twig::Elt->new('newroot') );
$parent_elt->cut;
$parent_elt->paste( $new_doc->root );
$new_doc -> set_pretty_print ('indented_a');
open( my $output, '>', "$newname.xml" ) or die $!;
print {$output} $new_doc->sprint;
close($output);
}
Given your sample data, this does approximately what you want.
OK, so the thing you didn't mention:
Martin thanks a lot. This works great. But there's one thing I forgot to mention. My original.xml has over one million parent tags.
Is quite important - XML is a tag-matching process, which means it can't be sure it's finished and the XML is valid until it reaches the end. That normally means parsing the whole document to verify that the tags are matched.
One of the drawbacks of XML is that it's memory footprint is often around 10x the file size.
However, XML::Twig
has another useful feature - twig_handlers
and purge
.
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
sub extract_parent_elt {
my ( $twig, $parent_elt ) = @_;
#remove children if that's what you want?
$_->delete for $parent_elt->children();
#pick out the attr for our file name
my $newname = $parent_elt->att('attr1');
print "Opening:", $newname, ".xml\n";
#create a new document - insert a 'newroot' as it's root element.
my $new_doc = XML::Twig->new->set_root( XML::Twig::Elt->new('newroot') );
#cut and paste this element into the new doc.
$parent_elt->cut;
$parent_elt->paste( $new_doc->root );
#note - because we're purging and not saving the 'old' doc, a
#cut doesn't modify the original.
#open output:
open( my $output, '>', "$newname.xml" ) or die $!;
#note - pretty print has some limitations.
#specifically - there are some XML things that it breaks.
#your code doesn't _appear_ to have these.
$new_doc -> set_pretty_print('indented_a');
print {$output} $new_doc->sprint;
close($output);
#discard everything so far.
$twig->purge;
}
my $twig = XML::Twig->new(
twig_handlers => { 'parent' => \&extract_parent_elt } );
$twig -> parsefile('original.xml');
The handler fires on finding a matched 'close' element, and gets handed that chunk of XML. purge
tells twig to discard anything that's been processed thus far (e.g. any that have had 'closed' tags).
Upvotes: 1