Reputation: 336
I am proccessing a Large file about 1GB with XML::Twig using Twig_handlers where the XML file is devided to Entries ,where every Entry tag contain all its sub tags.
I want to develop some mechanism to check if every Entry is already proccessed on the past by saving its MD5 digest and then when try to run the code again to check if this entry is proccessed on the past and have the same digest to skip it ,currently i do this mechanism inside the Entry which not help a lot as the Twig entry is proccessed before I check the digest ,could some one suggest if its possible to check the digest of every entry before building the Twig ?
here synopsis of my code :
XML::Twig->new(
twig_handlers => {
'Entry' => sub {
if(not exists_digest($_->outer_xml)){
#do somthing}
},
}
)->parsefile('myfile.xml');
Upvotes: 2
Views: 138
Reputation: 16171
I don't see any simple way to do this. The only way to get the text of the entry is to build the twig. Is there any id on the elements that you could use? If the id is constant between runs then you don't have to re-compute the MD5. But in any case the entire file is going to be parsed. You can't jump around the file without parsing each element.
Upvotes: 1
Reputation: 3483
I'm not sure if this is an available option with XML::Twig (it could be, I just don't know) but you can do this on your own using Digest::MD5 and a hash. Use a hash to keep a record of what MD5 values you've already seen:
use Digest::MD5 qw(md5);
my %exists_digest;
XML::Twig->new(
twig_handlers => {
'Entry' => sub {
my $md5 = md5($_->outer_xml);
if(!defined($exists_digest($md5)){
$exists_digest{$md5} = 1;
#do somthing
}
},
}
)->parsefile('myfile.xml');
Upvotes: 1