ssr1012
ssr1012

Reputation: 2589

Find the Level of Sections in XML Structured Document - perl

Find the Level of Sections in XML Structured Document - perl Input:

<section>
   <para>...level 1</para>
   <para>...level 1</para>
   <para>...level 1</para>
   <section>
      <para>...level 2</para>
      <para>...level 2</para>
      <section>
         <para>...level 3</para>
         <para>...level 3</para>
         <para>...level 3</para>
      </section>
      <para>...level 2</para>
   </section>
   <section>
      <para>...level 2</para>
      <para>...level 2</para>
      <para>...level 2</para>
   </section>
</section>
<section>
   <para>...level 1</para>
   <para>...level 1</para>
   <para>...level 1</para>
   <section>
      <para>...level 2</para>
      <para>...level 2</para>
      <para>...level 2</para>
   </section>
   <section>
      <para>...level 2</para>
      <para>...level 2</para>
      <para>...level 2</para>
   </section>
</section>

I need to fetch all the section level elements and insert the value based on the levels. And the desired output like as follows:

<section1>
<para>...level 1</para>
<para>...level 1</para>
<para>...level 1</para>
   <section2>
   <para>...level 2</para>
   <para>...level 2</para>
      <section3>
      <para>...level 3</para>
      <para>...level 3</para>
      <para>...level 3</para>
      </section3>
   <para>...level 2</para>
   </section2>
   <section2>
   <para>...level 2</para>
   <para>...level 2</para>
   <para>...level 2</para>
   </section2>
</section1>
<section1>
<para>...level 1</para>
<para>...level 1</para>
<para>...level 1</para>
   <section2>
   <para>...level 2</para>
   <para>...level 2</para>
   <para>...level 2</para>
   </section2>
   <section2>
   <para>...level 2</para>
   <para>...level 2</para>
   <para>...level 2</para>
   </section2>
</section1>

First try:

foreach my $lines ( @splitCnt ) {

    if ( $lines =~ m/<section\s+/g ) {
        $opn++;
        $lines =~ s/<section\s+/<section$opn /i;
        $cls = $opn;
        $opn++;
    }
    elsif ( $lines =~ m/<\/section>/g ) {
        $opn = $opn - 1;
        $lines =~ s/<\/section>/<\/section$opn>/i;
    }

    $all_lines .= "$lines\n";
}

Second Try:

my ( $pre1, $match1, $post1 ) = "";

while ( $incnt =~ m/<section\s+[^>]*>/g ) {

    $pre1   = $`;
    $match1 = $&;
    $post1  = $';
    my $Opn = '1';
    my $Cls = "";

    $match1 =~ s/<section\s+/<section$Opn /gi;

    if ( $post1 =~ m/<section\s+/i ) {
        $Opn++;
        $post1 =~ s/<section\s+/<section$Opn /;
        $Opn = $Cls;
    }
    elsif ( $post1 =~ m/<\/section>/i ) {
        $post1 =~ s/<\/section/<\/section$Cls/;
    }

    $pre1 .= $match1;
    $incnt = $post1;

    print "$pre1\n";
    system 'pause';
}

if ( length $pre1 ) {
    $incnt = $pre1 . $post1;
}

Anyone could help on this one...

Upvotes: 2

Views: 65

Answers (2)

Borodin
Borodin

Reputation: 126742

Here's a variant using the XML::LibXML module. It simply finds all the section elements and caclulates their hierarchy by counting the number of slashes in the XPath expression to reach them

However, as others have said, this is a strange thing to want to do, and it very much sounds like its a bad solution to a different problem. If you explained the complete problem then we could help you a lot better

use strict;
use warnings;

use XML::LibXML;

my $doc = XML::LibXML->load_xml(IO => \*DATA);

for my $section ( $doc->findnodes('//section') ) {
    my $n = $section->nodePath =~ tr|/|| - 1;
    $section->setNodeName("section$n");
}

print $doc;

__DATA__
<root>
    <section>
        <para>...level 1</para>
        <para>...level 1</para>
        <para>...level 1</para>
        <section>
            <para>...level 2</para>
            <para>...level 2</para>
            <section>
                <para>...level 3</para>
                <para>...level 3</para>
                <para>...level 3</para>
            </section>
            <para>...level 2</para>
        </section>
        <section>
            <para>...level 2</para>
            <para>...level 2</para>
            <para>...level 2</para>
        </section>
    </section>
    <section>
        <para>...level 1</para>
        <para>...level 1</para>
        <para>...level 1</para>
        <section>
            <para>...level 2</para>
            <para>...level 2</para>
            <para>...level 2</para>
        </section>
        <section>
            <para>...level 2</para>
            <para>...level 2</para>
            <para>...level 2</para>
        </section>
    </section>
</root>

output

<?xml version="1.0"?>
<root>
    <section1>
        <para>...level 1</para>
        <para>...level 1</para>
        <para>...level 1</para>
        <section2>
            <para>...level 2</para>
            <para>...level 2</para>
            <section3>
                <para>...level 3</para>
                <para>...level 3</para>
                <para>...level 3</para>
            </section3>
            <para>...level 2</para>
        </section2>
        <section2>
            <para>...level 2</para>
            <para>...level 2</para>
            <para>...level 2</para>
        </section2>
    </section1>
    <section1>
        <para>...level 1</para>
        <para>...level 1</para>
        <para>...level 1</para>
        <section2>
            <para>...level 2</para>
            <para>...level 2</para>
            <para>...level 2</para>
        </section2>
        <section2>
            <para>...level 2</para>
            <para>...level 2</para>
            <para>...level 2</para>
        </section2>
    </section1>
</root>

Upvotes: 2

Sobrique
Sobrique

Reputation: 53498

Seriously - don't use regular expresssions for XML. It's bad news. There's a bunch of perfectly valid things you can do with XML that break regular expressions - so what you get is broken XML, and brittle code that may well break horribly one day, and no one will know quite why.

Use a parser. Personally - I like XML::Twig

It's quite easy to take and rename tags as you ask:

#!/usr/bin/env perl
use strict;
use warnings;

use XML::Twig;

sub process_section {
    my ( $section, $depth ) = @_;
    $depth++;
    $section->set_tag("section$depth");
    foreach my $subsection ( $section->children('section') ) {
        process_section( $subsection, $depth );
    }
}

my $twig = XML::Twig->new( 'pretty_print' => 'indented_a' );
$twig->parsefile ( 'your_file.xml' ); 

foreach my $section ( $twig->findnodes('section') ) {
    process_section( $section, 0 );
}

$twig->print;

I would also point out though - your initial question sounds like an XY problem. What are you trying to accomplish? It's often undesirable to do this sort of operation - changing tags based on hierarchy, because then ... well, then you couldn't do what I've just done - iterate recursively through the data structure.

Upvotes: 4

Related Questions