JerseyDevel
JerseyDevel

Reputation: 1413

Perl XML::LibXML Getting info from specific nodes

I realize that there are many similar questions, but I am still unable to find the specific answer that I am looking for.

I am using Perl with the XML::LibXML library to read information from an XML file. The XML file has many nodes and many child nodes (and child child nodes, etc). I am trying to pull the information out of the XML file 'per node' but am really getting into the weeds trying to figure out how to do that.

Here is just an example of what I am trying to do:

#!/usr/bin/perl -w

use XML::LibXML

open ($xml_fh, "<test.xml");
my $dom = XML::LibXML->load_xml(IO => $xml_fh);;
close($xml_fh);

foreach $chapter ($dom->findnodes('/file/chapter')) {
        my $chapterNumber = $chapter->findvalue('@number');
        print "Chapter #$chapterNumber\n";

         #I tried $dom->findnodes('/file/chapter/section') <-- spelling out the xPath with same results..
        foreach $section ($dom->findnodes('//section')) {
                my $sectionNumber = $section->findvalue('@number');
                print " Section #$sectionNumber\n";

                foreach $subsection ($dom->findnodes('//subsection')) {
                        my $subsectionNumber = $subsection->findvalue('@number');
                        print "  SubSection $subsectionNumber\n";
                }
        }
}

This specific XML file is set up like this:

<file>
 <chapter number="1">
  <section number="abc123">
   There is some data here I'd like to get to
   <subsection number="abc123.(s)(4)">
    Some additional data here
    <subsection number="deeperSubSec">
     There might even be deeper subsections
     </subsection>
   </subsection>
  </section>
 </chapter>
 <chapter number="208">
  <section number="dgfj23">
   There is some data here I'd like to get to also
   <subsection number="dgfj23.(s)(4)">
    Some additional data here also
    <subsection number="deeperSubSec44">
     There might even be deeper subsections also
     </subsection>
   </subsection>
  </section>
 </chapter>
<chapter number="998">
  <section number="xxxid">
   There is even more data here I'd like to get to also
   <subsection number="xxxid.(s)(4)">
    Some additional data also here too
    <subsection number="deeperSubSec999">
     There might even be deeper subsections also again
     </subsection>
   </subsection>
  </section>
 </chapter>
</file>

Unfortunately, what I wind up with is just a list of repeating data. I am sure that this is because of my nested for loops, but I really an not grasping the fundamental understanding on how to operate on this data type. Hopefully someone has some resources or insight they could provide.

Here is my current output:

Chapter #1
 Section #abc123
  SubSection abc123.(s)(4)
  SubSection deeperSubSec
  SubSection dgfj23.(s)(4)
  SubSection deeperSubSec44
  SubSection xxxid.(s)(4)
  SubSection deeperSubSec999
 Section #dgfj23
  SubSection abc123.(s)(4)
  SubSection deeperSubSec
  SubSection dgfj23.(s)(4)
  SubSection deeperSubSec44
  SubSection xxxid.(s)(4)
  SubSection deeperSubSec999
 Section #xxxid
  SubSection abc123.(s)(4)
  SubSection deeperSubSec
  SubSection dgfj23.(s)(4)
  SubSection deeperSubSec44
  SubSection xxxid.(s)(4)
  SubSection deeperSubSec999
Chapter #208
 Section #abc123
  SubSection abc123.(s)(4)
  SubSection deeperSubSec
  SubSection dgfj23.(s)(4)
  SubSection deeperSubSec44
  SubSection xxxid.(s)(4)
  SubSection deeperSubSec999
 Section #dgfj23
  SubSection abc123.(s)(4)
  SubSection deeperSubSec
  SubSection dgfj23.(s)(4)
  SubSection deeperSubSec44
  SubSection xxxid.(s)(4)
  SubSection deeperSubSec999
 Section #xxxid
  SubSection abc123.(s)(4)
  SubSection deeperSubSec
  SubSection dgfj23.(s)(4)
  SubSection deeperSubSec44
  SubSection xxxid.(s)(4)
  SubSection deeperSubSec999
Chapter #998
 Section #abc123
  SubSection abc123.(s)(4)
  SubSection deeperSubSec
  SubSection dgfj23.(s)(4)
  SubSection deeperSubSec44
  SubSection xxxid.(s)(4)
  SubSection deeperSubSec999
 Section #dgfj23
  SubSection abc123.(s)(4)
  SubSection deeperSubSec
  SubSection dgfj23.(s)(4)
  SubSection deeperSubSec44
  SubSection xxxid.(s)(4)
  SubSection deeperSubSec999
 Section #xxxid
  SubSection abc123.(s)(4)
  SubSection deeperSubSec
  SubSection dgfj23.(s)(4)
  SubSection deeperSubSec44
  SubSection xxxid.(s)(4)
  SubSection deeperSubSec999

so for each chapter, I am reading ALL sections, then I am reading ALL subsections, etc. Over and over again..

What I want to do is read, for each chapter, the associated sections, then for each of those sections, the associated subsections and any applicable sub-subsections therein..

like this:

Chapter #1
  Section #abc123
    Subsection #abc123.(s)(4
      Sub-Subsection #deeperSubSec
Chapter #208
   Section #dgfj23
    Subsection #dgfj23.(s)(4)
     Sub-Subsection #deeperSubSec44

etc...

Additionally, eventually, after I figure out how the basic operation works, I'll need to get access to the data contained within each chapter, section, subsection, etc. But I think I need to walk before I run, so I'll go with trying to get the simple value of the attributes first..

Thank you for your help.

Upvotes: 1

Views: 469

Answers (1)

JerseyDevel
JerseyDevel

Reputation: 1413

So I think I figured it out. I was operating on the $dom object the entire time which contains the entire XML tree. I believe what I needed to do was operate on the piece of the tree that I am looking at, like this:

#!/usr/bin/perl -w

use XML::LibXML

open ($xml_fh, "<test.xml");
my $dom = XML::LibXML->load_xml(IO => $xml_fh);;
close($xml_fh);


for $chapter ($dom->findnodes('/file/chapter')) {
        print "Chapter #" . $chapter->findvalue('@number') ."\n";
        foreach $section ($chapter->findnodes('section')) {
                print " Section #" .$section->findvalue('@number') . "\n";
                foreach $subsection ($section->findnodes('subsection')) {
                        print "  Subsection #" . $subsection->findvalue('@number') . "\n";
                }
        }
}

which results in output more like I was hoping for:

Chapter #1
 Section #abc123
  Subsection #abc123.(s)(4)
Chapter #208
 Section #dgfj23
  Subsection #dgfj23.(s)(4)
Chapter #998
 Section #xxxid
  Subsection #xxxid.(s)(4)

Here is a little bit of a neater example which helps illustrate that I am now addressing the specific part of the tree obtained from the previous loop that I am currently inside:

#!/usr/bin/perl -w

use XML::LibXML

open ($xml_fh, "<test.xml");
my $dom = XML::LibXML->load_xml(IO => $xml_fh);;
close($xml_fh);


my @chapters = $dom->findnodes('/file/chapter');

for $chapter (@chapters) {
        my $chapterNo = $chapter->findvalue('@number');
        print "Chpater #$chapterNo\n";

        @sections = $chapter->findnodes('section');
        for $section (@sections) {
                my $sectionNo = $section->findvalue('@number');
                print " Section #$sectionNo\n";

                @subsections = $section->findnodes('subsection');
                for $subsection (@subsections) {
                        my $subsectionNo = $subsection->findvalue('@number');
                        print "  Subsection #$subsectionNo\n";
                }
        }
}

Upvotes: 3

Related Questions