ohm
ohm

Reputation: 73

Perl LibXML and multiple namespaces

I have an problem I could sure use some help with. First, be gentle. I am new to both perl and LibXML. I have been parsing a document and placing elements into an array that is then written to a spreadsheet column. During testing it was discovered that some nodes have more than one child node of the same name. I need to combine the text from each of these child nodes into one element of the array. The format of the xml is:

<Group id="V-3021"
  xmlns="http://checklists.nist.gov/xccdf/1.1"
  xmlns:dc="http://purl.org/dc/elements/1.1">
    <title>blah blah blah</title>
    <description>blah blah blah</description>
    <Rule id="SV-41507r1_rule" severity="medium" weight="10.0">
        <version>blah blah blah</version>
        <title>blah blah blah</title>
        <description>blah blah blah</description>
        <reference>
            <dc:title>blah blah blah</dc:title>
            <dc:publisher>blah blah blahO</dc:publisher>
            <dc:type>blah blah blah</dc:type>
            <dc:subject>blah blah blah</dc:subject>
            <dc:identifier>blah blah blah</dc:identifier>
        </reference>
        <fixtext fixref="F-3046r3_fix">blah blah blah</fixtext>
        <check system="C-39986r2_chk">
            <check-content-ref name="M" href="VMS_XCCDF_Benchmark_Network - Firewall -   Cisco.xml"/>
            <check-content>This is the text I want</check-content>
        </check>
    </Rule>
</Group>

But occasionally it is like this:

<Group id="V-3021"
  xmlns="http://checklists.nist.gov/xccdf/1.1"
  xmlns:dc="http://purl.org/dc/elements/1.1">
    <title>blah blah blah</title>
    <description>blah blah blah</description>
    <Rule id="SV-41507r1_rule" severity="medium" weight="10.0">
        <version>blah blah blah</version>
        <title>blah blah blah</title>
        <description>blah blah blah</description>
        <reference>
            <dc:title>blah blah blah</dc:title>
            <dc:publisher>blah blah blahO</dc:publisher>
            <dc:type>blah blah blah</dc:type>
            <dc:subject>blah blah blah</dc:subject>
            <dc:identifier>blah blah blah</dc:identifier>
        </reference>
        <fixtext fixref="F-3046r3_fix">blah blah blah</fixtext>
        <check system="C-39986r2_chk">
            <check-content-ref name="M" href="VMS_XCCDF_Benchmark_Network - Firewall - Cisco.xml"/>
            <check-content>This is the text I want</check-content>
            <check-content>This is more text that I wantto grab and add to the end of                           the above text
            </check-content>
        </check>
    </Rule>
</Group>

I can pull all the text from "check-contents", but if there is more than one it throws off the row of data in the spreadsheet. I need to be able to say something like: If there are 2 or more join the data an push into the array. If not, just push the data into the array. Now here is where the rub comes in. I am trying to pull everything below "Rule" and then parse each section ( to ) and pull the "check-contents" from each of those sections of XML. By doing this I should be able to join the two "check-content" section together before pushing the data into an array. The problem is that there is a namespace declared under the "reference" node (dc:). I have tried registering this namespace with no luck. I actually don't care about that section of data at all, but when I try and pull this section ( to ) I get an error message that states ":1: namespace error : Namespace prefix dc on title is not defined s>ECAT-1, ECAT-2, ECSC-1

my $parser = XML::LibXML->new() or die $!;
my $doc1 = $parser->parse_file($filename1);
my $xc1 = XML::LibXML::XPathContext->new($doc1->documentElement() );
$xc1->registerNs(x => 'http://checklists.nist.gov/xccdf/1.1');
$xc1->registerNs(dc => 'http://purl.org/dc/elements/1.1');


for $Check ( $xc1->findnodes('//x:Rule') ) { 

    my $doc2 = $parser->parse_string($Check); # Associate the NS with $Check
    my $xc2 = XML::LibXML::XPathContext->new($doc2->documentElement());
    $xc2->registerNs(x => 'http://checklists.nist.gov/xccdf/1.1');


    foreach $Check_Content ( $xc2->findvalue('check-content') ) { 

         push (@Check_Content1, $Check_Content);

         }


    $result_string = $Check_Content1[0] . $Check_Content1[1];
    push (@Check_Content, $result_string);
    }
}

Upvotes: 1

Views: 608

Answers (1)

ikegami
ikegami

Reputation: 385496

At line 10 of your code, you ask XML::LibXML to parse $Check, which means you're asking XML::LibXML to parse the following:

<Rule id="SV-41507r1_rule" severity="medium" weight="10.0">
    <version>blah blah blah</version>
    <title>blah blah blah</title>
    <description>blah blah blah</description>
    <reference>
        <dc:title>blah blah blah</dc:title>
        <dc:publisher>blah blah blahO</dc:publisher>
        <dc:type>blah blah blah</dc:type>
        <dc:subject>blah blah blah</dc:subject>
        <dc:identifier>blah blah blah</dc:identifier>
    </reference>
    <fixtext fixref="F-3046r3_fix">blah blah blah</fixtext>
    <check system="C-39986r2_chk">
        <check-content-ref name="M" href="VMS_XCCDF_Benchmark_Network - Firewall - Cisco.xml"/>
        <check-content>This is the text I want</check-content>
        <check-content>This is more text that I wantto grab and add to the end of                           the above text
        </check-content>
    </check>
</Rule>

That's not a well-formed XML document since it doesn't defined dc.

All of this in an attempt to construct a second needless XPC. This can be solved by chopping lots of code out.

my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($filename);
my $xpc = XML::LibXML::XPathContext->new( $doc->documentElement() );
$xpc->registerNs(x  => 'http://checklists.nist.gov/xccdf/1.1');
$xpc->registerNs(dc => 'http://purl.org/dc/elements/1.1');

my $check_content;
for my $rule_node ( $xpc->findnodes('//x:Rule') ) { 
   for my $check_content_node (
         $xpc->findnodes('x:check/x:check-content', $rule_node) ) { 
      $check_content .= $check_content_node->textContent();
   }
}

Note the second arg to $xpc->findnodes.

It didn't make much sense to use an array, so I didn't. You can always put $check_content into an array if that makes sense.

Of course, the following might also be an option to you:

my $check_content;
for my $rule_node ( $xpc->findnodes('//x:Rule/x:check/x:check-content') ) { 
   $check_content .= $check_content_node->textContent();
}

Upvotes: 1

Related Questions