maddy
maddy

Reputation: 83

Error when trying to split XML file using XML::LibXML module

I have been trying to split XML data using the XML::LibXML module, but it throws an error like this

Can't call method "findnodes" without a package or object reference

My input

<xml>
  <bhap id="1">
    <label>cylind - I</label>
    <title>premier</title>
    <rect id="S1">
      <title>Short</title>
      <label>1.</label>
      <p><text>welcome</text></p>
    </rect>
    <rect id="S2">
      <title>Definite</title>
      <label>2.</label>
      <p><text>welcome1</text></p>
    </rect>
  </bhap>
  <bhap id="2">
    <label>cylind – II</label>
    <title>AUTHORITIES AND ITS EMPLOYEES</title>
    <rect id="S3">
      <title>nauty.&#x2014;</title>
      <label>3.</label>
      <p><text>welcome3</text></p>
    </rect>
    <rect id=S4">
      <title>Term</title>
      <label>4.</label>
      <p><text>welcome4</text></p>
    </rect>
  </bhap>
</xml>

output needed

file 1

<xml>
  <bhap id="1">
    <label>cylind - I</label>
    <title>premier</title>
    <rect id="S1">
      <title>Short</title>
      <label>1.</label>
      <p><text>welcome</text></p>
    </rect>
  </bhap>
</xml>

file 2

<xml>   
  <bhap id="1">
    <label>cylind - I</label>
    <title>premier</title>
    <rect id="S2">
      <title>Definite</title>
      <label>2.</label>
      <p><text>welcome1</text></p>
    </rect>
  </bhap>
</xml>

file 3

<xml>
  <bhap id="2">
    <label>cylind – II</label>
    <title>AUTHORITIES AND ITS EMPLOYEES</title>
    <rect id="S3">
      <title>nauty.&#x2014;</title>
      <label>3.</label>
      <p><text>welcome3</text></p>
    </rect>
  </bhap>
</xml>

file 4

<xml>       
  <bhap id="2">
    <label>cylind – II</label>
    <title>AUTHORITIES AND ITS EMPLOYEES</title>
    <rect id=S4">
      <title>Term</title>
      <label>4.</label>
      <p><text>welcome4</text></p>
    </rect>
  </bhap>
</xml>

my code

use XML::LibXML;

my $file   = shift || die "usage $0 <xmlfile>";
my $parser = XML::LibXML->new();
my $doc    = $parser->parse_file($file);

my @nodes = $doc->findnodes('//bhap');
foreach my $node1 (@nodes) {

    my $bhap = $node1->toString(), "\n";

    if ( $bhap =~ m/(<bhap.+?>.+?<\/title>)(.+?)(<\/bhap>)/is ) {

        my $bhap1 = $1;
        my $bhap2 = $2;
        my $bhap3 = $3;

        my $nodes1 = $bhap->findnodes('//rect');
        foreach my $node (@$nodes1) {

            my $rect = $node->toString();

            if ( $rect =~ m/(<rect\s*id="(.+?)">.+?<\/rect>)/is ) {

                my $var1 = $1;
                my $var2 = $2;

                print "file" $var2;
                print "<xml>" print $bhap1;
                print $var1;
                print $bhap3;
                print "</xml>";
            }
        }
    }
}

Upvotes: 1

Views: 493

Answers (1)

Sobrique
Sobrique

Reputation: 53478

OK, so you start out well, but then ... fall in to the 'regular expression' trap. XML is not a good thing to parse with regular expressions, because it's just too complicated - do do it well you need to handle/validate tag nestings, and line feeds and all sorts of things that basically just make your regular expression a brittle piece of code. So please don.

But most importantly of all - ALWAYS use strict and warnings prior to posting queries. These are your first port of call for troubleshooting.

If you did you would see things like:

print "file" $var2;

That's not going to work - at all. There's a bunch of others that aren't going to work properly in 'your code' so really - that would be the starting point.

Also - your XML isn't valid - your 'S4' I think is missing a quote mark.

Anyway, assuming that's just a typo, I'd start with XML::Twig (because I understand it better than LibXML rather than any specific reason) and do something like this:

#!/usr/bin/perl

use strict;
use warnings;
use XML::Twig;

my %children_of;

#as we process, extract all the 'rect' elements - along with a reference to their context.
sub process_rect {
    my ( $twig, $rect ) = @_;
    push( @{ $children_of{ $rect->parent } }, $rect->cut );
}


my $twig = XML::Twig->new(
    'pretty_print'  => 'indented',
    'twig_handlers' => { 'rect' => \&process_rect },

);

$twig->parse( \*DATA );

#run through all the 'bhap' elements. 
foreach my $bhap ( $twig->root->children('bhap') ) {
    #find the rect elements under this bhap. 
    foreach my $rect ( @{ $children_of{$bhap} } ) {
        #create a new XML document - copy the 'root' name from your original document. 
        my $xml    = XML::Twig::Elt->new( $twig -> root -> name );
        #duplicate this 'bhap' element by copying it, rather than cutting it,
        #so we can paste it more than once (e.g. per 'rect')
        my $subset = $bhap->copy;
        #insert the 'bhap' into our new xml. 
        $subset->paste( last_child => $xml );
        #insert our cut rect beneath this bhap. 
        $rect->paste( last_child => $subset );

        #print the resulting XML. 
        print "--\n";
        $xml->print;
    }
}

__DATA__
<xml>

<bhap id="1">
                <label>cylind - I</label>
                <title>premier</title>
                <rect id="S1">
                    <title>Short</title>
                    <label>1.</label>
                    <p><text>welcome</text></p>
                </rect>
                <rect id="S2">
                    <title>Definite</title>
                    <label>2.</label>
                    <p><text>welcome1</text></p>
                </rect>
        </bhap>
            <bhap id="2">
                <label>cylind - II</label>
                <title>AUTHORITIES AND ITS EMPLOYEES</title>

                <rect id="S3">
                    <title>nauty.&#x2014;</title>
                    <label>3.</label>
                    <p><text>welcome3</text></p>
                </rect>

                <rect id="S4">
                    <title>Term</title>
                    <label>4.</label>
                    <p><text>welcome4</text></p>
                </rect></bhap>

</xml>

We preprocess the XML, and 'snip out' the rect nodes. Then we cycle through each of the bhap nodes - copying them, and inserting the relevant rect beneath them.

This gives output of:

--
<xml>
  <bhap id="1">
    <label>cylind - I</label>
    <title>premier</title>
    <rect id="S1">
      <title>Short</title>
      <label>1.</label>
      <p>
        <text>welcome</text>
      </p>
    </rect>
  </bhap>
</xml>
--
<xml>
  <bhap id="1">
    <label>cylind - I</label>
    <title>premier</title>
    <rect id="S2">
      <title>Definite</title>
      <label>2.</label>
      <p>
        <text>welcome1</text>
      </p>
    </rect>
  </bhap>
</xml>
--
<xml>
  <bhap id="2">
    <label>cylind - II</label>
    <title>AUTHORITIES AND ITS EMPLOYEES</title>
    <rect id="S3">
      <title>nauty.—</title>
      <label>3.</label>
      <p>
        <text>welcome3</text>
      </p>
    </rect>
  </bhap>
</xml>
--
<xml>
  <bhap id="2">
    <label>cylind - II</label>
    <title>AUTHORITIES AND ITS EMPLOYEES</title>
    <rect id="S4">
      <title>Term</title>
      <label>4.</label>
      <p>
        <text>welcome4</text>
      </p>
    </rect>
  </bhap>
</xml>

Which looks at least fairly close to what you're trying to produce. I've skipped over reading in files and printing out the content, because reconstructing the XML is the harder part.

I would also suggest looking at xml_split which is available with XML::Twig as that might do exactly what you want anyway.

Upvotes: 2

Related Questions