Reputation: 1666
Since everyone here advised on using the Perl module Mojo::DOM
for this task, I am asking how to do it with it.
I have this html code in template:
some html content here top base
<!--block:first-->
some html content here 1 top
<!--block:second-->
some html content here 2 top
<!--block:third-->
some html content here 3a
some html content here 3b
<!--endblock-->
some html content here 2 bottom
<!--endblock-->
some html content here 1 bottom
<!--endblock-->
some html content here bottom base
What I want to do (please do not suggest using Templates modules again), I want to find the inner block first:
<!--block:third-->
some html content here 3a
some html content here 3b
<!--endblock-->
then replace it with some html code, then find the second block:
<!--block:second-->
some html content here 2 top
<!--block:third-->
some html content here 3a
some html content here 3b
<!--endblock-->
some html content here 2 bottom
<!--endblock-->
then replace it with some html code, then find the third block:
<!--block:first-->
some html content here 1 top
<!--block:second-->
some html content here 2 top
<!--block:third-->
some html content here 3a
some html content here 3b
<!--endblock-->
some html content here 2 bottom
<!--endblock-->
some html content here 1 bottom
<!--endblock-->
Upvotes: 1
Views: 334
Reputation: 35198
I did not advise using Mojo::DOM
for this task, as it's probably overkill, but ... you could.
The real answer is the one that I've already stated in other questions, and that is to use an already existing framework such as Template::Toolkit
. It's powerful, well tested, and speedy since it allows for the caching of templates.
However, you desire to roll your own templating solution. Any such solution should include a parsing, validation, and execution phase. We're just going to be focusing on the first two steps as you've shared no real info on the last.
There is not going to be any real magic in Mojo::DOM
. Its benefit and power is that it can fully and easily parse HTML, catching all of those potential edge cases. It will only be able to help with the parsing phase of templating though as it's your own rules that decide the validation. In fact, it basically just performs like a drop in replacement for split
in my earlier solution I provided to you. That's why it's probably too heavy weight of a solution.
Because it's not hard to make the modifications, I've went ahead coded a full solution below. However, to make things more interesting, and to try to prove one of my greater points, it's time to share some Benchmark
testing between the 3 available solutions:
Mojo::DOM
for parsing as demonstrated below.split
for parsing as proposed by me in Match nested html comment blocks regex
recursive regex
proposed by sln
in Perl replace nested blocks regex
The below contains all three solutions:
use strict;
use warnings;
use Benchmark qw(:all);
use Mojo::DOM;
use Data::Dump qw(dump dd);
my $content = do {local $/; <DATA>};
#dd parse_using_mojo($content);
#dd parse_using_split($content);
#dd parse_using_regex($content);
timethese(100_000, {
'regex' => sub { parse_using_regex($content) },
'mojo' => sub { parse_using_mojo($content) },
'split' => sub { parse_using_split($content) },
});
sub parse_using_mojo {
my $content = shift;
my $dom = Mojo::DOM->new($content);
# Resulting Data Structure
my @data = ();
# Keep track of levels of content
# - This is a throwaway data structure to facilitate the building of nested content
my @levels = ( \@data );
for my $html ($dom->all_contents->each) {
if ($html->node eq 'comment') {
# Start of Block - Go up to new level
if ($html =~ m{^<!--\s*block:(.*)-->$}s) {
#print +(' ' x @levels) ."<$1>\n"; # For debugging
my $hash = {
block => $1,
content => [],
};
push @{$levels[-1]}, $hash;
push @levels, $hash->{content};
next;
# End of Block - Go down level
} elsif ($html =~ m{^<!--\s*endblock\s*-->$}) {
die "Error: Unmatched endblock found before " . dump($html) if @levels == 1;
pop @levels;
#print +(' ' x @levels) . "</$levels[-1][-1]{block}>\n"; # For debugging
next;
}
}
push @{$levels[-1]}, '' if !@{$levels[-1]} || ref $levels[-1][-1];
$levels[-1][-1] .= $html;
}
die "Error: Unmatched start block: $levels[-2][-1]{block}" if @levels > 1;
return \@data;
}
sub parse_using_split {
my $content = shift;
# Tokenize Content
my @tokens = split m{<!--\s*(?:block:(.*?)|(endblock))\s*-->}s, $content;
# Resulting Data Structure
my @data = (
shift @tokens, # First element of split is always HTML
);
# Keep track of levels of content
# - This is a throwaway data structure to facilitate the building of nested content
my @levels = ( \@data );
while (@tokens) {
# Tokens come in groups of 3. Two capture groups in split delimiter, followed by html.
my ($block, $endblock, $html) = splice @tokens, 0, 3;
# Start of Block - Go up to new level
if (defined $block) {
#print +(' ' x @levels) ."<$block>\n"; # For Debugging
my $hash = {
block => $block,
content => [],
};
push @{$levels[-1]}, $hash;
push @levels, $hash->{content};
# End of Block - Go down level
} elsif (defined $endblock) {
die "Error: Unmatched endblock found before " . dump($html) if @levels == 1;
pop @levels;
#print +(' ' x @levels) . "</$levels[-1][-1]{block}>\n"; # For Debugging
}
# Append HTML content
push @{$levels[-1]}, $html;
}
die "Error: Unmatched start block: $levels[-2][-1]{block}" if @levels > 1;
return \@data;
}
sub parse_using_regex {
my $content = shift;
my $href = {};
ParseCore( $href, $content );
return $href;
}
sub ParseCore
{
my ($aref, $core) = @_;
# Set the error mode on/off here ..
my $BailOnError = 1;
my $IsError = 0;
my ($k, $v);
while ( $core =~ /(?is)(?:((?&content))|(?><!--block:(.*?)-->)((?&core)|)<!--endblock-->|(<!--(?:block:.*?|endblock)-->))(?(DEFINE)(?<core>(?>(?&content)|(?><!--block:.*?-->)(?:(?&core)|)<!--endblock-->)+)(?<content>(?>(?!<!--(?:block:.*?|endblock)-->).)+))/g )
{
if (defined $1)
{
# CONTENT
$aref->{content} .= $1;
}
elsif (defined $2)
{
# CORE
$k = $2; $v = $3;
$aref->{$k} = {};
# $aref->{$k}->{content} = $v;
# $aref->{$k}->{match} = $&;
my $curraref = $aref->{$k};
my $ret = ParseCore($aref->{$k}, $v);
if ( $BailOnError && $IsError ) {
last;
}
if (defined $ret) {
$curraref->{'#next'} = $ret;
}
}
else
{
# ERRORS
print "Unbalanced '$4' at position = ", $-[0];
$IsError = 1;
# Decide to continue here ..
# If BailOnError is set, just unwind recursion.
# -------------------------------------------------
if ( $BailOnError ) {
last;
}
}
}
return $k;
}
__DATA__
some html content here top base
<!--block:first-->
<table border="1" style="color:red;">
<tr class="lines">
<td align="left" valign="<--valign-->">
<b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
<!--hello--> <--again--><!--world-->
some html content here 1 top
<!--block:second-->
some html content here 2 top
<!--block:third-->
some html content here 3 top
<!--block:fourth-->
some html content here 4 top
<!--block:fifth-->
some html content here 5a
some html content here 5b
<!--endblock-->
<!--endblock-->
some html content here 3a
some html content here 3b
<!--endblock-->
some html content here 2 bottom
<!--endblock-->
some html content here 1 bottom
<!--endblock-->
some html content here1-5 bottom base
some html content here 6-8 top base
<!--block:six-->
some html content here 6 top
<!--block:seven-->
some html content here 7 top
<!--block:eight-->
some html content here 8a
some html content here 8b
<!--endblock-->
some html content here 7 bottom
<!--endblock-->
some html content here 6 bottom
<!--endblock-->
some html content here 6-8 bottom base
The results for the simple template with 3 nested blocks:
Benchmark: timing 100000 iterations of mojo, regex, split...
mojo: 50 wallclock secs (50.36 usr + 0.00 sys = 50.36 CPU) @ 1985.78/s (n=100000)
regex: 14 wallclock secs (13.42 usr + 0.00 sys = 13.42 CPU) @ 7453.79/s (n=100000)
split: 2 wallclock secs ( 2.70 usr + 0.00 sys = 2.70 CPU) @ 37050.76/s (n=100000)
Normalizing to regex at 100%, equates to mojo at 375%, and split at 20%.
And for the more complicated template included in the above code:
Benchmark: timing 100000 iterations of mojo, regex, split...
mojo: 237 wallclock secs (236.61 usr + 0.02 sys = 236.62 CPU) @ 422.61/s (n=100000)
regex: 46 wallclock secs (47.25 usr + 0.00 sys = 47.25 CPU) @ 2116.31/s (n=100000)
split: 7 wallclock secs ( 6.65 usr + 0.00 sys = 6.65 CPU) @ 15046.64/s (n=100000)
Normalizing to regex at 100%, equates to mojo at 501%, and split at 14%. (7 times as fast)
Does speed matter?
As is demonstrated above, we can see without question that my split
solution is going to be faster than any of the other solutions thus far. This should not be a surprise. It's an extremely simple tool and therefore it's fast.
In truth though, speed doesn't really matter.
Why not? Well, because whatever data structure you build from parsing and validating the template can be cached and reloaded each time you want to execute a template, until a template changes.
Final decisions
Because speed doesn't matter with caching, what you should focus on instead is how readable is the code, how fragile is it, how easily can it be extended and debugged, etc.
As much as I appreciate a well crafted regex, they tend to be fragile. Putting all of your parsing and validation logic into a single line of code is just asking for trouble.
That leaves either the split solution or mojo.
If you're caching like I described, you can actually choose either one without concern. The code I provided for each is essentially the same with slight variations, so it gets to be personal preference. Even though split is 20-35 times faster for the initial parsing matters less than if the code is more maintainable using an actual HTML Parser.
Good luck choosing your final approach. I still have my fingers crossed you'll go with TT
some day, but you'll pick your own poison :)
Upvotes: 1