Jorge
Jorge

Reputation: 21

Change text content of XML using XML::LibXML with a regex pattern

I have to clean a twitter corpus in XML, I've parsed it with `XML::LibXML.

original.xml

<?xml version="1.0" encoding="UTF-8"?>
<tweets>
  <tweet>
    <tweetid>768213876278165504</tweetid>
    <user>OnceBukowski</user>
    <content>@caca, #holadictadura, RT no me daaaaaa la gana</content>
  </tweet>
<tweet>

main.pl

my $filename = 'original.xml';

my $dom = XML::LibXML->load_xml( location => $filename );

foreach my $tweet ( $dom->findnodes( '//tweet' ) ) {

    my ( $content ) = $tweet->findvalue( './content' );

    #say $content;

    #~ $content =~ s///g;
    $content =~ s/@//g;
    $content =~ s/#/tío/g;
    $content =~ s/ k /que/g;
    $content =~ s/ ke /que/g;
    $content =~ s/pls/por favor/g;

    #say $content; }

I don't understand why when I print out:

   print $dom->toString;

The changes that I made in $content have not been included/inserted into the output.

I read that you can replace content node with appendText, but this is not working for me.

Upvotes: 0

Views: 342

Answers (2)

haukex
haukex

Reputation: 3013

You seem to expect $content to be an alias to the actual DOM node(s), but it is not, it's just a plain string that you need to put back in the DOM tree. Here's one way to do that, it assumes that <content> can only have text and no other child nodes:

foreach my $tweet ($dom->findnodes('//tweet')) {
    my @content = $tweet->findnodes('./content');
    die "<tweet> didn't have exactly one <content>: $tweet"
        unless @content==1;
    my $text = $content[0]->textContent;
    $text =~ s/@//g;
    $text =~ s/#/tío/g;
    $text =~ s/ ke? /que/g;
    $text =~ s/pls/por favor/g;
    $content[0]->removeChildNodes();
    $content[0]->appendText($text);
}
print $dom->toString;

Upvotes: 0

choroba
choroba

Reputation: 241858

You can for example get the content element and set its text() child's data to the new string:

#!/usr/bin/perl
use warnings;
use strict;
use utf8;
use feature qw{ say };

use XML::LibXML;

my $dom = 'XML::LibXML'->load_xml(IO => *DATA);

for my $tweet ($dom->findnodes('//tweet')) {
    my ($content) = $tweet->findnodes('./content');

    my $string = $content->findvalue('.');
    $string =~ s/@//g;
    $string =~ s/#/tío/g;
    $string =~ s/ k /que/g;
    $string =~ s/ ke /que/g;
    $string =~ s/pls/por favor/g;

    $content->findnodes('text()')->[0]->setData($string);
}

say $dom->toString;

__DATA__
<?xml version="1.0" encoding="UTF-8"?>
    <tweets>
    <tweet>
    <tweetid>768213876278165504</tweetid>
    <user>OnceBukowski</user>
    <content>@caca, #holadictadura, RT no me daaaaaa la gana</content>
    </tweet>
</tweets>

Upvotes: 3

Related Questions