Reputation: 2769
My question is as follows:
I have to read a big XML file, 50 MB; and anonymise some tags/fields that relate to private issues, like name surname address, email, phone number, etc...
I know exactly which tags in XML are to be anonymised.
s|<a>alpha</a>|MD5ed(alpha)|e;
s|<h>beta</h>|MD5ed(beta)|e;
where alpha
and beta
refer to any characters within, which will also be hashed, using probably an algorithm like MD5.
I will only convert the tag value, not the tags themselves.
I hope, I am clear enough about my problem. How do I achieve this?
Upvotes: 3
Views: 2352
Reputation: 16161
Using regexps is indeed dangerous, unless you know exactly the format of the file, it's easy to parse with regexps, and you are sure that it will not change in the future.
Otherwise you could indeed use XML::Twig,as below. An alternative would be to use XML::LibXML, although the file might be a bit big to load it entirely in memory (then again, maybe not, memory is cheap these days) so you might have to use the pull mode, which I don't know much about.
Compact XML::Twig code:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
use Digest::MD5 'md5_base64';
my @tags_to_anonymize= qw( name surname address email phone);
# the handler for each element ($_) sets its content with the md5 and then flushes
my %handlers= map { $_ => sub { $_->set_text( md5_base64( $_->text))->flush } } @tags_to_anonymize;
XML::Twig->new( twig_roots => \%handlers, twig_print_outside_roots => 1)
->parsefile( "my_big_file.xml")
->flush;
Upvotes: 4
Reputation: 62099
As Welbog said, don't try to parse XML with a regex. You'll regret it eventually.
Probably the easiest way to do this is using XML::Twig. It can process XML in chunks, which lets you handle very large files.
Another possibility would be using SAX, especially with XML::SAX::Machines. I've never really used that myself, but it's a stream-oriented system, so it should be able to handle large files. The downside is that you'll probably have to write more code to collect the text inside each tag that you care about (where XML::Twig will collect that text for you).
Upvotes: 3
Reputation: 391854
You have to do something like the following in Python.
import xml.etree.ElementTree as xml # or lxml or whatever
import hashlib
theDoc= xml.parse( "sample.xml" )
for alphaTag in theDoc.findall( "xpath/to/tag" ):
print alphaTag, alphaTag.text
alphaTag.text = hashlib.md5(alphaTag.text).hexdigest()
xml.dump(theDoc)
Upvotes: 6
Reputation: 60418
Bottom line: don't parse XML using regex.
Use your language's DOM parsing libraries instead, and if you know the elements you need to anonymize, grab them using XPath and hash their contents by setting their innerText/innerHTML properties (or whatever your language calls them).
Upvotes: 4