Scott
Scott

Reputation: 37

Process quoted string within XML

Perl version: perl, v5.10.1 (*) built for x86_64-linux-thread-multi

I am a relative newbie to perl. I have tried looking at the various XML processing utilities for Perl, XML::Simple, XML::Parser, XML::LibXML, XML::DOM, XML::XML::Twig, XML::XPath etc.

I am trying to process some XML that has quotes in the value portion. I am specifically looking to extract the title from the below XML, however, I've been stumbling over this for a bit now and would appreciate some help if possible.

$VAR1 = {
   'issue' => {
       'priority' => {
             'fid' => '11',
             'content' => '3 - Best Effort'
           },
       'transNum' => {
             'fid' => '2',
             'content' => '170'
           },
       'dueDate' => {
             'fid' => '17',
             'content' => '1327944695'
           },
       'status' => {
             'fid' => '18',
             'content' => 'Open - Unassigned'
           },
       'createdBy' => {
             'fid' => '15',
             'content' => '32'
           },
       'title' => {
             'fid' => '20',
             'content' => 'Testing on spider - issue with "quotation marks"'
           },
       'description' => {
             'fid' => '22',
             'content' => 'Noticed issue with title having quotes in title'
           },
       'issueNum' => {
             'fid' => '1',
             'content' => '33'
           }
   }
};

Using XML::LibXML and following code (Note: above if print of contents of $issueXML variable):

my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($issueXML);
print $doc->toString;

This prints out:

<?xml version="1.0" encoding="utf-8"?>
<issues>
 <issue>
   <issueNum fid="1">33</issueNum>
   <transNum fid="2">170</transNum>
   <createdBy fid="15">32</createdBy>
   <status fid="18">Open - Unassigned</status>
   <title fid="20">Testing on spider - issue with "quotation marks"</title>
   <priority fid="11">3 - Best Effort</priority>
   <description fid="22">Noticed issue with submission of Documentation issue #40 on accurev with quotes in title. </description>
  <dueDate fid="17">1327944695</dueDate>
 </issue>
</issues>

I am looking to specifically extract value for the title tag. When I was processing using XML::Parser, I kept ending up with just the final quote mark. I would like to maintain the same format of the string to display:
Testing on spider - issue with "quotation marks"

I am a bit overwhelmed at the moment with the various XML processing functions. I have tried for awhile now to figure this out, and I am seriously spinning my wheels.

TIA, Appreciate any help,

Regards, Scott

Upvotes: 2

Views: 280

Answers (4)

zgpmax
zgpmax

Reputation: 2847

Your best way of pulling bits out of XML is with an XPath query.

In this case you are looking for the element 'title', inside an element 'issue', inside an element 'issues'.

So your XPath query is simply '//issues/issue/title'.

In two lines of code, you can use XML::LibXML::XPathContext to perform the XPath query for you, which will return the element's content which you are looking for.

This code snippet will demonstrate a simple way of doing an XPath query. The important bit of it is the two lines following the comment "Relevant bit here".

For more information, see the documentation for XML::LibXML::XPathContext

#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;

my $xml = XML::LibXML->load_xml(string => q{<?xml version="1.0" encoding="utf-8"?>
<issues>
 <issue>
   <issueNum fid="1">33</issueNum>
   <transNum fid="2">170</transNum>
   <createdBy fid="15">32</createdBy>
   <status fid="18">Open - Unassigned</status>
   <title fid="20">Testing on spider - issue with "quotation marks"</title>
   <priority fid="11">3 - Best Effort</priority>
   <description fid="22">Noticed issue with submission of Documentation issue #40 on accurev with quotes in title. </description>
  <dueDate fid="17">1327944695</dueDate>
 </issue>
</issues>
});

# Relevant bit here
my $xc = XML::LibXML::XPathContext->new($xml);
my $title = $xc->find('//issues/issue/title');
print "$title\n";

# prints:
# Testing on spider - issue with "quotation marks"

Upvotes: 0

Leonardo Herrera
Leonardo Herrera

Reputation: 8406

Another go with XML::LibXML. You should have no problems with quotation marks inside text nodes.

#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
use Data::Dumper;

my $xml = XML::LibXML->load_xml(string => q{<?xml version="1.0" encoding="utf-8"?>
<issues>
 <issue>
   <issueNum fid="1">33</issueNum>
   <transNum fid="2">170</transNum>
   <createdBy fid="15">32</createdBy>
   <status fid="18">Open - Unassigned</status>
   <title fid="20">Testing on spider - issue with "quotation marks"</title>
   <priority fid="11">3 - Best Effort</priority>
   <description fid="22">Noticed issue with submission of Documentation issue #40 on accurev with quotes in title. </description>
  <dueDate fid="17">1327944695</dueDate>
 </issue>
</issues>
});

my $title = $xml->find('/issues/issue/title');
print $title->get_node(0)->textContent;

Upvotes: 2

choroba
choroba

Reputation: 241858

I usually use XML::XSH2 for XML manipulation. Your problem simplifies to:

open FILE.xml ;
for //title echo (.) ;

Upvotes: 0

mirod
mirod

Reputation: 16161

I am not sure what problem you run into with the quotation marks. They're just a character like any other, except in attribute values where you may have to use an entity if the quote is already used as the value delimiter. Are you sure the "problem" is not just with the way Data::Dumper displays the data structure generated by XML::Simple?

In any case stay away from XML::Parser, which is too low-level, use XML::LibXML or XML::Twig. XML::Simple seems to generate a lot of questions, especially from people not familiar with Perl, so I am not sure it's the right tool to use.

Here is a solution with XML::Twig, but there are any other ways to do this, depending on exactly what you want to do with the titles.

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

my $issueXML=q{<?xml version="1.0" encoding="utf-8"?>
<issues>
 <issue>
   <issueNum fid="1">33</issueNum>
   <transNum fid="2">170</transNum>
   <createdBy fid="15">32</createdBy>
   <status fid="18">Open - Unassigned</status>
   <title fid="20">Testing on spider - issue with "quotation marks"</title>
   <priority fid="11">3 - Best Effort</priority>
   <description fid="22">Noticed issue with submission of Documentation issue #40 on accurev with quotes in title. </description>
  <dueDate fid="17">1327944695</dueDate>
 </issue>
</issues>
};

my $t= XML::Twig->new( twig_handlers => { title => sub { print $_->text, "\n"; } })
                ->parse( $issueXML);

Upvotes: 2

Related Questions