Reputation: 21
I have to clean a twitter corpus in XML, I've parsed it with `XML::LibXML.
<?xml version="1.0" encoding="UTF-8"?>
<tweets>
<tweet>
<tweetid>768213876278165504</tweetid>
<user>OnceBukowski</user>
<content>@caca, #holadictadura, RT no me daaaaaa la gana</content>
</tweet>
<tweet>
my $filename = 'original.xml';
my $dom = XML::LibXML->load_xml( location => $filename );
foreach my $tweet ( $dom->findnodes( '//tweet' ) ) {
my ( $content ) = $tweet->findvalue( './content' );
#say $content;
#~ $content =~ s///g;
$content =~ s/@//g;
$content =~ s/#/tío/g;
$content =~ s/ k /que/g;
$content =~ s/ ke /que/g;
$content =~ s/pls/por favor/g;
#say $content; }
I don't understand why when I print out:
print $dom->toString;
The changes that I made in $content
have not been included/inserted into the output.
I read that you can replace content node with appendText
, but this is not working for me.
Upvotes: 0
Views: 342
Reputation: 3013
You seem to expect $content
to be an alias to the actual DOM node(s), but it is not, it's just a plain string that you need to put back in the DOM tree. Here's one way to do that, it assumes that <content>
can only have text and no other child nodes:
foreach my $tweet ($dom->findnodes('//tweet')) {
my @content = $tweet->findnodes('./content');
die "<tweet> didn't have exactly one <content>: $tweet"
unless @content==1;
my $text = $content[0]->textContent;
$text =~ s/@//g;
$text =~ s/#/tío/g;
$text =~ s/ ke? /que/g;
$text =~ s/pls/por favor/g;
$content[0]->removeChildNodes();
$content[0]->appendText($text);
}
print $dom->toString;
Upvotes: 0
Reputation: 241858
You can for example get the content element and set its text() child's data to the new string:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
use feature qw{ say };
use XML::LibXML;
my $dom = 'XML::LibXML'->load_xml(IO => *DATA);
for my $tweet ($dom->findnodes('//tweet')) {
my ($content) = $tweet->findnodes('./content');
my $string = $content->findvalue('.');
$string =~ s/@//g;
$string =~ s/#/tío/g;
$string =~ s/ k /que/g;
$string =~ s/ ke /que/g;
$string =~ s/pls/por favor/g;
$content->findnodes('text()')->[0]->setData($string);
}
say $dom->toString;
__DATA__
<?xml version="1.0" encoding="UTF-8"?>
<tweets>
<tweet>
<tweetid>768213876278165504</tweetid>
<user>OnceBukowski</user>
<content>@caca, #holadictadura, RT no me daaaaaa la gana</content>
</tweet>
</tweets>
Upvotes: 3