Reputation: 275
I could do the same thing in java or c# with ease but doing this in shell scripting involves lot of learning...so any help is appreciated
I have a huge xml node with child nodes like URL (lets say 100K nodes) and I need to split the input.xml with 10K nodes in each subfile,so I get 10 files containing 10K nodes with parent tag in tact (URLSet tab).
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc> https://www.mywebsite.com/shopping </loc>
<changefreq> Weekly </changefreq>
<priority> 0.8 </priority>
<lastmod> 2016-09-22 </lastmod>
</url>
<url>
<loc> https://www.mywebsite.com/shopping </loc>
<changefreq> Weekly </changefreq>
<priority> 0.8 </priority>
<lastmod> 2016-09-22 </lastmod>
</url>
<url>
<loc> https://www.mywebsite.com/shopping </loc>
<changefreq> Weekly </changefreq>
<priority> 0.8 </priority>
<lastmod> 2016-09-22 </lastmod>
</url>
<url>
<loc> https://www.mywebsite.com/shopping </loc>
<changefreq> Weekly </changefreq>
<priority> 0.8 </priority>
<lastmod> 2016-09-22 </lastmod>
</url>
<url>
<loc> https://www.mywebsite.com/shopping </loc>
<changefreq> Weekly </changefreq>
<priority> 0.8 </priority>
<lastmod> 2016-09-22 </lastmod>
</url>
<url>
<loc> https://www.mywebsite.com/shopping </loc>
<changefreq> Weekly </changefreq>
<priority> 0.8 </priority>
<lastmod> 2016-09-22 </lastmod>
</url>
</urlset>
Upvotes: 1
Views: 669
Reputation: 53498
Short answer is yes, this is totally doable.
XML::Twig
supports "cut" and "paste" operations, as well as incremental parsing (for lower memory footprint).
So you'd do something like:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
#new document. Manually set xmlns - could copy this from 'original'
#instead though.
my $new_doc = XML::Twig->new;
$new_doc->set_root(
XML::Twig::Elt->new(
'urlset', { xmlns => "http://www.sitemaps.org/schemas/sitemap/0.9" }
)
);
$new_doc->set_pretty_print('indented_a');
my $elt_count = 0;
my $elts_per_doc = 2;
my $count_of_xml = 0;
#handle each 'url' element.
sub handle_url {
my ( $twig, $elt ) = @_;
#more than the count, we output this doc, close it,
#then create a new one.
if ( $elt_count >= $elts_per_doc ) {
$elt_count = 0;
open( my $output, '>', "new_xml_" . $count_of_xml++ . ".xml" )
or warn $!;
print {$output} $new_doc->sprint;
close($output);
$new_doc = XML::Twig->new();
$new_doc->set_root(
XML::Twig::Elt->new(
'urlset',
{ xmlns => "http://www.sitemaps.org/schemas/sitemap/0.9" }
)
);
$new_doc->set_pretty_print('indented_a');
}
#cut this element, paste it into new doc.
#note - this doesn't alter the original on disk - only the 'in memory'
#copy.
$elt->cut;
$elt->paste( $new_doc->root );
$elt_count++;
#purge clears any _closed_ tags from memory, so it preserves
#structure.
$twig->purge;
}
#set a handler, start the parse.
my $twig = XML::Twig->new( twig_handlers => { 'url' => \&handle_url } ) ->parsefile ( 'your_file.xml' );
Upvotes: 2