Anamika
Anamika

Reputation: 9

Translating malformed HTML to hierarchical XML

I am currently looking for a way in Perl to write the following output in XML files

Example input

<h1>1 Top level heading
Para text 1
Para text 2
<h2>1.1 Sub level heading
Para text 3
Para text 4
<h3>1.1.1 Sub sub level heading
Para text 5
Para text 6
<h2>Sub level heading 2
Para text 7
Para text 8
<h1>Top level heading
Para text 1
Para text 2

Required output

<h1>
 <label>1</label>
 <title>Top level heading</title>
 <p>Para text 1</p>
 <p>Para text 2</p>
 
 <h2>
  <label>1.1</label>
  <title>Sub level heading</title>
  <p>Para text 3</p>
  <p>Para text 4</p>

  <h3>
    <label>1.1</label>
    <title>Sub sub level heading</title>
    <p>Para text 5</p>
    <p>Para text 6</p>
  </h3>
 </h2>

 <h2>Sub level heading (no number prefix)
  <p>Para text 7</p>
  <p>Para text 8</p>
 </h2>
</h1>

<h1>Top level heading (no number prefix)
<p>Para text 9</p>
<p>Para text 10</p>
</h1>

I tried a lot but found no logic to achieve this.

Could someone help me to get started?

Update

@Borodin's code works well based on the above input snippet, but my actual requirement is as follows:

Input.txt

<art>Ärticle Title
<smry>1 Summåry
 Summary paragragh 1...
 Summary paragragh 2...
</smry>
<subjg>Subject Group Title
 subject 1; subject 2; subject 3
</subjg>

<h1>1 Top level heading
  Para text 1
  <img gr1.jpg>
  Para text 2

  <h2>1.1 Sub level heading
    Para text 3
    Para text 4
    <img gr2.jpg>

  <h2>1.2 Sub level heading
    Para text 5
    Para text 6

   <h3>1.1.1 Sub sub level heading
     Para text 7
     <fcap>Label 1: Text...
     <grp line1.png>
     Para text 8

   <h3>1.1.2 Sub sub level heading
     Para text 9
     Para text 10
  <h2>Sub level heading
    <fcap>Text only...
    <grp line2.png>
    Para text 11
    Para text 12

<h1>Top level heading
 Para text 13
 Para text 14

  <h2>Sub level heading
    Para text 15
    Para text 16

<blst>Books
 [1] Book name 1...
 [2] Book name 2...
 [3] Book name 3...
</blst>

<art>
...
<art>
...

Required Output.xml

<?xml version="1.0" encoding="UTF-8"?>
<article>
  <front>
    <title>&#x00C4;rticle Title</title>
    <summary>
      <label>1</label>
      <title>Summ&#x00E5;ry</title>
      <p>Summary paragragh 1...</p>
      <p>Summary paragragh 2...</p>
    </summary>
    <subj-group>
      <title>Subject Group Title</title>
      <sub>subject 1</sub>
      <sub>subject 2</sub>
      <sub>subject 3</sub>
    </subj-group>
  </front>
  <body>
    <h1 id="s1">
      <label>1</label>
      <title>Top level heading</title>
      <p>Para text 1</p>
      <img src="gr1.jpg" id="gr1"/>
      <p>Para text 2</p>
      <h2 id="s1a">
        <label>1.1</label>
        <title>Sub level heading</title>
        <p>Para text 3</p>
        <p>Para text 4</p>
        <img src="gr2.jpg" id="gr2"/>
      </h2>
      <h2 id="s1b">
        <label>1.2</label>
        <title>Sub level heading</title>
        <p>Para text 5</p>
        <p>Para text 6</p>
        <h3 id="s1b1">
          <label>1.1.1</label>
          <title>Sub sub level heading</title>
          <p>Para text 7</p>
          <figure id="grp1">
            <label>Label 1:</label>
            <cap><p>Text...</p></cap>
            <graphic src="line1.png"/>
          </figure>
          <p>Para text 8</p>
        </h3>
        <h3 id="s1b2">
          <label>1.1.2</label>
          <title>Sub sub level heading</title>
          <p>Para text 9</p>
          <p>Para text 10</p>
        </h3>
      </h2>
      <h2 id="s1c">
        <title>Sub level heading 2</title>
        <figure id="grp2">
          <cap><p>Text only...</p></cap>
          <graphic src="line2.png"/>
        </figure>
        <p>Para text 11</p>
        <p>Para text 12</p>
      </h2>
    </h1>
    <h1 id="s2">
      <title>Top level heading</title>
      <p>Para text 13</p>
      <p>Para text 14</p>
      <h2 id="s2a">
        <title>Sub level heading 2</title>
        <p>Para text 15</p>
        <p>Para text 16</p>
      </h2>
    </h1>
  </body>
  <back>
    <booklist>
      <title>Books</title>
      <bookname id="b1"><l>[1]</l><t>Book name 1...</t></bookname>
      <bookname id="b2"><l>[2]</l><t>Book name 2...</t></bookname>
      <bookname id="b3"><l>[3]</l><t>Book name 3...</t></bookname>
    </booklist>
  </back>
</article>

Could someone help me on this?

Upvotes: 0

Views: 117

Answers (2)

Valdi_Bo
Valdi_Bo

Reputation: 31011

There is no need to print XML on your own, including indentation handling. I think, a simpler solution is to use a dedidated module, e.g. XML::Writer.

Below you have a reworked version of program proposed by Borodin, using just XML::Writer.

use strict; use warnings; use autodie; use XML::Writer;

my @stack;
my $wr = XML::Writer->new(OUTPUT => 'self', DATA_MODE => 1,
    DATA_INDENT => 2, UNSAFE => 1);

sub endTags {
    my $lev = shift;
    while (@stack and $stack[-1] >= $lev) {
        pop(@stack);
        $wr->endTag();
    }
}

my @blocks = do {
    open my $fh, '<', 'input.txt';
    local $/;   # Slurp mode
    grep /\S/, split /<(h\d)>/, <$fh>;
};
$wr->startTag('main');
push @stack, 0;         # Treat "main" as 0 level node
while (@blocks) {
    my $tag  = shift @blocks;   # Tag name
    my $text = shift @blocks;   # Content (up to the next <h...>)
    my @text = split /\n/, $text;
    s/\A\s+|\s+\z//g for @text;
    die unless $tag =~ /h(\d)/;
    my $level = $1;
    endTags($level);
    push @stack, $level;
    $wr->startTag($tag);
    if ($text[0] =~ /^\b[\d.]+\b/) {
        my ($label, $title) = split ' ', shift(@text), 2;
        $wr->dataElement(label => $label);
        $wr->dataElement(title => $title);
    } else {
        $wr->characters(shift(@text) . ' (no number prefix)');
    }
    $wr->dataElement('p' => $_) for @text;
}
endTags(0);
my $xml = $wr->end();
print $xml;

As you can see, some fragments are identical (no need to reinvent the wheel), but e.g. closing (ending) of XML tags was moved to a dedidated function, called twice.

This program is also compliant with requirements concerning proper XML formatting, namely the XML file must have a single root level node (here I called it main).

I had to set UNSAFE option in XML::Writer, otherwise it complains about mixed content (an element containing both text nodes and child elements).

A quite clever trick is that I used endTags function also to end main tag. It was possible, because XML::Writer keeps track of tag names opened by the user, so endTag function actually does not require the name of the tag to be closed.

Upvotes: -1

Borodin
Borodin

Reputation: 126742

Having said it was quite difficult, I thought the least I could do was to offer a solution!

I've added some comments and I hope it's pretty much self-explanatory

Note that it ignores all HTML tags except for the <h1> etc. and I haven't made an attempt to add the blank lines you show as there doesn't seem to be any logic behind them

I'm wondering if this is really what you want, as putting multiple paragraphs inside a <h1> element is rather odd. Anyway, I hope this helps


Note for the inquisitive:

I am pretty sure that this can be done with just a scalar count of preceding levels. I started off coding that way but ended up using a stack as it helped my thinking, but because @stack only ever contains 1..3 etc. I think it must be sufficient to use a scalar that is equivalent to the number of elements in @stack, and increment and decrement it in place of pushing and popping the array

use strict;
use warnings 'all';
use autodie;

# Read the file and split it on the header tags

my @blocks = do {
    open my $fh, '<', 'input.html';
    local $/;
    grep /\S/, split /(<h\d>)/, <$fh>;
};

my @stack;

while ( @blocks ) {

    my $tag  = shift @blocks;
    my $text = shift @blocks;
    my @text = split /\n/, $text;

    s/\A\s+|\s+\z//g for @text;  # Trim text lines

    die unless $tag =~ /h(\d+)/; # Check well-formed tag
    my $level = $1;              # and grab hierarchy level

    # Close all outstanding tags until we reach this level
    while ( @stack and $stack[-1] >= $level ) {
        my $l = $stack[-1];
        print indent($l-1), "</h$l>\n";
        pop @stack;
    }

    # Opening tag, on its own or with label and title if they're there
    if ( $text[0] =~ /^\b[\d.]+\b/ ) {

        print indent($level-1), $tag, "\n";

        my ($label, $title) = split ' ', shift(@text), 2;

        print indent($level), $_, "\n" for
                "<label>$label</label>",
                "<title>$title</title>";
    }
    else {
        print indent($level-1), $tag, shift @text, "\n";
    }

    # Print the remaining text lines as paragraphs                
    print indent($level), $_, "\n" for map { "<p>$_</p>" } @text;

    # Remember that this tag needs closing
    push @stack, $level;
}

# Close all outstanding tags
while ( @stack ) {
    my $l = $stack[-1];
    print indent($l-1), "</h$l>\n";
    shift @stack;
}


sub indent {
    my $n = shift;
    '  ' x $n;
}

output

<h1>
  <label>1</label>
  <title>Top level heading</title>
  <p>Para text 1</p>
  <p>Para text 2</p>
  <h2>
    <label>1.1</label>
    <title>Sub level heading</title>
    <p>Para text 3</p>
    <p>Para text 4</p>
    <h3>
      <label>1.1.1</label>
      <title>Sub sub level heading</title>
      <p>Para text 5</p>
      <p>Para text 6</p>
    </h3>
  </h2>
  <h2>Sub level heading 2
    <p>Para text 7</p>
    <p>Para text 8</p>
  </h2>
</h1>
<h1>Top level heading
  <p>Para text 1</p>
  <p>Para text 2</p>
</h1>

Upvotes: 2

Related Questions