Reputation: 9
I am currently looking for a way in Perl to write the following output in XML files
h1
Is the parent level
h2
is the child level of h1
h3
is a child level of h2
(or a subchild of h1
) etc.
<h1>1 Top level heading
Para text 1
Para text 2
<h2>1.1 Sub level heading
Para text 3
Para text 4
<h3>1.1.1 Sub sub level heading
Para text 5
Para text 6
<h2>Sub level heading 2
Para text 7
Para text 8
<h1>Top level heading
Para text 1
Para text 2
<h1>
<label>1</label>
<title>Top level heading</title>
<p>Para text 1</p>
<p>Para text 2</p>
<h2>
<label>1.1</label>
<title>Sub level heading</title>
<p>Para text 3</p>
<p>Para text 4</p>
<h3>
<label>1.1</label>
<title>Sub sub level heading</title>
<p>Para text 5</p>
<p>Para text 6</p>
</h3>
</h2>
<h2>Sub level heading (no number prefix)
<p>Para text 7</p>
<p>Para text 8</p>
</h2>
</h1>
<h1>Top level heading (no number prefix)
<p>Para text 9</p>
<p>Para text 10</p>
</h1>
I tried a lot but found no logic to achieve this.
Could someone help me to get started?
@Borodin's code works well based on the above input snippet, but my actual requirement is as follows:
<art>Ärticle Title
<smry>1 Summåry
Summary paragragh 1...
Summary paragragh 2...
</smry>
<subjg>Subject Group Title
subject 1; subject 2; subject 3
</subjg>
<h1>1 Top level heading
Para text 1
<img gr1.jpg>
Para text 2
<h2>1.1 Sub level heading
Para text 3
Para text 4
<img gr2.jpg>
<h2>1.2 Sub level heading
Para text 5
Para text 6
<h3>1.1.1 Sub sub level heading
Para text 7
<fcap>Label 1: Text...
<grp line1.png>
Para text 8
<h3>1.1.2 Sub sub level heading
Para text 9
Para text 10
<h2>Sub level heading
<fcap>Text only...
<grp line2.png>
Para text 11
Para text 12
<h1>Top level heading
Para text 13
Para text 14
<h2>Sub level heading
Para text 15
Para text 16
<blst>Books
[1] Book name 1...
[2] Book name 2...
[3] Book name 3...
</blst>
<art>
...
<art>
...
<?xml version="1.0" encoding="UTF-8"?>
<article>
<front>
<title>Ärticle Title</title>
<summary>
<label>1</label>
<title>Summåry</title>
<p>Summary paragragh 1...</p>
<p>Summary paragragh 2...</p>
</summary>
<subj-group>
<title>Subject Group Title</title>
<sub>subject 1</sub>
<sub>subject 2</sub>
<sub>subject 3</sub>
</subj-group>
</front>
<body>
<h1 id="s1">
<label>1</label>
<title>Top level heading</title>
<p>Para text 1</p>
<img src="gr1.jpg" id="gr1"/>
<p>Para text 2</p>
<h2 id="s1a">
<label>1.1</label>
<title>Sub level heading</title>
<p>Para text 3</p>
<p>Para text 4</p>
<img src="gr2.jpg" id="gr2"/>
</h2>
<h2 id="s1b">
<label>1.2</label>
<title>Sub level heading</title>
<p>Para text 5</p>
<p>Para text 6</p>
<h3 id="s1b1">
<label>1.1.1</label>
<title>Sub sub level heading</title>
<p>Para text 7</p>
<figure id="grp1">
<label>Label 1:</label>
<cap><p>Text...</p></cap>
<graphic src="line1.png"/>
</figure>
<p>Para text 8</p>
</h3>
<h3 id="s1b2">
<label>1.1.2</label>
<title>Sub sub level heading</title>
<p>Para text 9</p>
<p>Para text 10</p>
</h3>
</h2>
<h2 id="s1c">
<title>Sub level heading 2</title>
<figure id="grp2">
<cap><p>Text only...</p></cap>
<graphic src="line2.png"/>
</figure>
<p>Para text 11</p>
<p>Para text 12</p>
</h2>
</h1>
<h1 id="s2">
<title>Top level heading</title>
<p>Para text 13</p>
<p>Para text 14</p>
<h2 id="s2a">
<title>Sub level heading 2</title>
<p>Para text 15</p>
<p>Para text 16</p>
</h2>
</h1>
</body>
<back>
<booklist>
<title>Books</title>
<bookname id="b1"><l>[1]</l><t>Book name 1...</t></bookname>
<bookname id="b2"><l>[2]</l><t>Book name 2...</t></bookname>
<bookname id="b3"><l>[3]</l><t>Book name 3...</t></bookname>
</booklist>
</back>
</article>
Could someone help me on this?
Upvotes: 0
Views: 117
Reputation: 31011
There is no need to print XML on your own, including indentation handling. I think, a simpler solution is to use a dedidated module, e.g. XML::Writer.
Below you have a reworked version of program proposed by Borodin, using just XML::Writer.
use strict; use warnings; use autodie; use XML::Writer;
my @stack;
my $wr = XML::Writer->new(OUTPUT => 'self', DATA_MODE => 1,
DATA_INDENT => 2, UNSAFE => 1);
sub endTags {
my $lev = shift;
while (@stack and $stack[-1] >= $lev) {
pop(@stack);
$wr->endTag();
}
}
my @blocks = do {
open my $fh, '<', 'input.txt';
local $/; # Slurp mode
grep /\S/, split /<(h\d)>/, <$fh>;
};
$wr->startTag('main');
push @stack, 0; # Treat "main" as 0 level node
while (@blocks) {
my $tag = shift @blocks; # Tag name
my $text = shift @blocks; # Content (up to the next <h...>)
my @text = split /\n/, $text;
s/\A\s+|\s+\z//g for @text;
die unless $tag =~ /h(\d)/;
my $level = $1;
endTags($level);
push @stack, $level;
$wr->startTag($tag);
if ($text[0] =~ /^\b[\d.]+\b/) {
my ($label, $title) = split ' ', shift(@text), 2;
$wr->dataElement(label => $label);
$wr->dataElement(title => $title);
} else {
$wr->characters(shift(@text) . ' (no number prefix)');
}
$wr->dataElement('p' => $_) for @text;
}
endTags(0);
my $xml = $wr->end();
print $xml;
As you can see, some fragments are identical (no need to reinvent the wheel), but e.g. closing (ending) of XML tags was moved to a dedidated function, called twice.
This program is also compliant with requirements concerning proper XML formatting, namely the XML file must have a single root level node (here I called it main).
I had to set UNSAFE option in XML::Writer, otherwise it complains about mixed content (an element containing both text nodes and child elements).
A quite clever trick is that I used endTags function also to end main tag. It was possible, because XML::Writer keeps track of tag names opened by the user, so endTag function actually does not require the name of the tag to be closed.
Upvotes: -1
Reputation: 126742
Having said it was quite difficult, I thought the least I could do was to offer a solution!
I've added some comments and I hope it's pretty much self-explanatory
Note that it ignores all HTML tags except for the <h1>
etc. and I haven't made an attempt to add the blank lines you show as there doesn't seem to be any logic behind them
I'm wondering if this is really what you want, as putting multiple paragraphs inside a <h1>
element is rather odd. Anyway, I hope this helps
Note for the inquisitive:
I am pretty sure that this can be done with just a scalar count of preceding levels. I started off coding that way but ended up using a stack as it helped my thinking, but because @stack
only ever contains 1..3
etc. I think it must be sufficient to use a scalar that is equivalent to the number of elements in @stack
, and increment and decrement it in place of pushing
and popping
the array
use strict;
use warnings 'all';
use autodie;
# Read the file and split it on the header tags
my @blocks = do {
open my $fh, '<', 'input.html';
local $/;
grep /\S/, split /(<h\d>)/, <$fh>;
};
my @stack;
while ( @blocks ) {
my $tag = shift @blocks;
my $text = shift @blocks;
my @text = split /\n/, $text;
s/\A\s+|\s+\z//g for @text; # Trim text lines
die unless $tag =~ /h(\d+)/; # Check well-formed tag
my $level = $1; # and grab hierarchy level
# Close all outstanding tags until we reach this level
while ( @stack and $stack[-1] >= $level ) {
my $l = $stack[-1];
print indent($l-1), "</h$l>\n";
pop @stack;
}
# Opening tag, on its own or with label and title if they're there
if ( $text[0] =~ /^\b[\d.]+\b/ ) {
print indent($level-1), $tag, "\n";
my ($label, $title) = split ' ', shift(@text), 2;
print indent($level), $_, "\n" for
"<label>$label</label>",
"<title>$title</title>";
}
else {
print indent($level-1), $tag, shift @text, "\n";
}
# Print the remaining text lines as paragraphs
print indent($level), $_, "\n" for map { "<p>$_</p>" } @text;
# Remember that this tag needs closing
push @stack, $level;
}
# Close all outstanding tags
while ( @stack ) {
my $l = $stack[-1];
print indent($l-1), "</h$l>\n";
shift @stack;
}
sub indent {
my $n = shift;
' ' x $n;
}
<h1>
<label>1</label>
<title>Top level heading</title>
<p>Para text 1</p>
<p>Para text 2</p>
<h2>
<label>1.1</label>
<title>Sub level heading</title>
<p>Para text 3</p>
<p>Para text 4</p>
<h3>
<label>1.1.1</label>
<title>Sub sub level heading</title>
<p>Para text 5</p>
<p>Para text 6</p>
</h3>
</h2>
<h2>Sub level heading 2
<p>Para text 7</p>
<p>Para text 8</p>
</h2>
</h1>
<h1>Top level heading
<p>Para text 1</p>
<p>Para text 2</p>
</h1>
Upvotes: 2