Reputation: 668
I have to process a huge XML file (>10 GB) to convert it to CSV. I am using XML::Twig
.
The file contains data of around 2.6 million customers, each of which will have around 100 to 150 fields (depends on customers profile).
I store all the values of one subscriber in hash %customer
, and when processing is done I output the values of the hash to a text file in CSV format.
The issue is the performance. It takes around 6 to 8 hours to process it. How it can be reduced?
my $t = XML::Twig->new(
twig_handlers => {
'objects/simple' => \&simpleProcess ,
'objects/detailed' => \&detailedProcess ,
},
twig_roots => { objects => 1}
);
sub simpleProcess {
my ($t, $simple) = @_;
%customer= (); #reset the hash
$customer{id} = $simple->first_child_text('id');
$customer{Key} = $simple->first_child_text('Key');
}
The detailed tags includes several fields, including nested fields. So I call up a function every time for collecting different types of fields.
sub detailedProcess {
my ($t, $detailed1) = @_;
$detailed = $detailed1;
if ($detailed->has_children('profile11')){ &profile11();}
if ($detailed->has_children('profile12')){ &profile12();}
if ($detailed->has_children('profile13')){ &profile13();}
}
sub profile11 {
foreach $comcb ($detailed->children('profile11')) {
$customer{COMCBcontrol} = $comcb->first_child_text('ValueID');
}
The same goes for other functions *(value2, value3). I am not mentioning the other functions for keeping it simple.
<objecProfile>
<simple>
<id>12345</id>
<Key>N894FE</Key>
</simple>
<detailed>
<ntype>single</ntype>
<SubscriberType>genericSubscriber</SubscriberType>
<odbssm>0</odbssm>
<osb1>true</osb1>
<natcrw>true</natcrw>
<sr>2</sr>
<Profile11>
<ValueID>098765</ValueID>
</Profile11>
<Profile21>
<ValueID>098765</ValueID>
</Profile21>
<Profile22>
<ValueID>098765</ValueID>
</Profile22>
<Profile61>
<ValueID>098765</ValueID>
</Profile61>
</detailed>
</objectProfile>
Now the question is: I use foreach
for every child even though almost every time the child instance occurs only once throughout the customer profile. Could it cause the delay, or are there any other suggestions to improve the performance? Threading etc.? (I googled and found that threading doesn't help much.)
Upvotes: 2
Views: 427
Reputation: 126742
I suggest using XML::LibXML::Reader
. It is very efficient because it doesn't build an XML tree in memory unless you ask it to, and is based on the excellent LibXML library.
You will have to get used to a different API from XML::Twig
, but IMO it is still fairly simple.
This code does exactly what your own code does, and my timings suggested that 10 million records like the one you show will be processed in 30 minutes.
It works by repeatedly scanning for the next <object>
element (I wasn't sure if this should be <objecProfile>
as your question is inconsistent), copying the node and its descendants to an XML::LibXML::Element
object $copy
so that the subtree can be accessed, and pulling out the information required into %customer
.
use strict;
use warnings;
use XML::LibXML::Reader;
my $filename = 'objects.xml';
my $reader = XML::LibXML::Reader->new(location => $filename)
or die qq(cannot read "$filename": $!);
while ($reader->nextElement('object')) {
my %customer;
my $copy = $reader->copyCurrentNode(1);
my ($simple) = $copy->findnodes('simple');
$customer{id} = $simple->findvalue('id');
$customer{Key} = $simple->findvalue('Key');
my ($detailed) = $copy->findnodes('detailed');
$customer{COMCBcontrol} = $detailed->findvalue('(Profile11 | Profile12 | Profile13)/ValueID');
# Do something with %customer
}
Upvotes: 2
Reputation: 7357
First, use DProf or NYTProf to figure what is slowing down your code. But, i think the main work will be inside XML parser, so my opinion this not be able to increase speed greatly.
As another variant i suggest you to split(not parse), just this XML into pieces(need to save xml format consistency) and run ncpu forks to process each independently, produce some file with agregate values and then process it.
Or, you can transform this XML into something that is parseable without XML parser. For example: seems you need id, Key, ValueID fields, so you can remove "\n" in input file and produce some other file, with one objectProfile per line. Then, feed each line to the parser. This can allow you to use multithread processing of one file, so you will use all CPUs. Probably string </objectProfile>
can work as record separator. Need to study format of your xml to make decision.
P.S. Someone will want to downvote me with "parsing XML by yourself is bad" or some links like this. But, sometimes when you have big highload or very big input data - you had a choice: do it in "lawful" style; or do it in given time with given precision. The users/customers do not care how you do it, they want result.
Upvotes: 1