LibXML - Looping through nodes until

Question

I'm trying to parse the below XML using Perl's XML::LibXML library.


  
 
  
    2.1 Study purpose 
    This is study purpose content
    content 1
    content 2
    content 3 
    content 4
    3. Some Header
    obj content 4
    obj content 2

For the Header Study Purpose, I'm trying to display all the related siblings. So my expected output is:

2.1 Study purpose 
This is study purpose content
content 1
content 2
content 3 
content 4

My Perl code is below. I can display the first node.

Given a value of the first node,Study purpose, is there a way I can loop and print all the nodes until I hit a node containing a "digit followed by a '.'"?

My perl implementation:

my $purpose_str = 'Purpose and rationale|Study purpose|Study rationale';
$parser = XML::LibXML->new;
#print "Parser for file $file is: $parser 
";     
$dom = $parser->parse_file($file);

$root = $dom->getDocumentElement;
$dom->setDocumentElement($root);

for my $purpose_search('/TaggedPDF-doc/Part/Sect/H4')
{
    $purpose_nodeset = $dom->find($purpose_search);
    foreach my $purp_node ($purpose_nodeset -> get_nodelist)
    {
        if ($purp_node =~ m/$purpose_str/i)
        {
            #Get the corresponding child nodes
            @childnodes = $purp_node->nonBlankChildNodes();

            $first_kid = shift @childnodes;
            $second_kid = $first_kid->nextNonBlankSibling();
            #$third_kid = $second_kid->nextNonBlankSibling();

            $first_kid -> string_value;
            $second_kid -> string_value;
            #$third_kid -> string_value;
        }

        print "Study Purpose is: $first_kid
.$second_kid
";
    }
}

choroba · Accepted Answer

Do not look at child nodes if you want siblings. Use textContent if you want to match the node's text content.

#!/usr/bin/perl
use warnings;
use strict;
use XML::LibXML;

my $file        = 'input.xml';
my $purpose_str = 'Purpose and rationale|Study purpose|Study rationale';
my $dom         = XML::LibXML->load_xml(location => $file);

for my $purpose_search('/TaggedPDF-doc/Part/Sect/H4')
{
    my $purpose_nodeset = $dom->find($purpose_search);
    for my $purp_node ($purpose_nodeset -> get_nodelist)
    {
        if ($purp_node->textContent =~ m/$purpose_str/i)
        {
            my @siblings = $purp_node->find('following-sibling::*')
                           ->get_nodelist;

            for my $i (0 .. $#siblings)
            {
                if ($siblings[$i]->textContent =~ /^[0-9]+\./)
                {
                    splice @siblings, $i;
                    last;
                }
            }

            print $_->textContent, "
" for @siblings;
        }

    }
}

LibXML - Looping through nodes until

Answers (1)

Related Questions