user6550864
user6550864

Reputation:

Escaping special Tags for HTML

I'm pretty new to perl and recently wrote a converter for our SharePoint. It basically takes our old wiki's html pages and converts them to aspx pages with SP classes and so on.

Everything works fine till the point someone used <tags> as Text. Here's an example form the html of the old twiki:

<li> Moduldateinamen haben folgendes Format <code> <Content>_<Type>_<Name>_</code> ...

So <Content> <Type> <Name> is text wrapped in a <code> Tag

How it looks in old wiki:

How it looks in old wiki

How it looks after converting to aspx and uploaded to SharePoint: How it looks after converting to aspx and uploaded to SharePoint

You can see that SP tried to interpret them as tags (sure), not as text and therefore it wont be displayed.

For SharePoint Pages I need an escaped HTML markup between SP ASPX markup. So I've changed f.e. < to &lt; and so on via regex.

However the example snippet I posted should look like this in the ASPX:

&lt;li&gt; Moduldateinamen haben folgendes Format &lt;code>&gt;openTagContentclosingTag_openTagTypeclosingTag_openTagNameclosingTag_&lt;/code&gt;

So < converts to openTag and > to closing Tag but only for actual content between this <li> tag. Later this needs to be changed by hand (I don't see another way)

How can I achieve that only "text" tags get escaped with openTag/closingTag but "real" HTML markup gets escape in this manner < to &lt;

Upvotes: 0

Views: 124

Answers (2)

Alexandr Evstigneev
Alexandr Evstigneev

Reputation: 723

As far as I understand the question right, all you need is a regex like:

$page =~ s{(?<=<code>)(.+?)(?=<\/code>)}
          {
              my $text = $1;
              $text =~ s/([<>])/ $1 eq '<' ? '&lt;': '&gt;'/ge;
              $text;
          }gxe;

Upvotes: 1

simbabque
simbabque

Reputation: 54333

I fiddled around and found a solution that might work. It does work with example data, but I don't know how complicated your actual documents are.

Consider this example input.

my $html = <<'HTML';
<ul>
    <li><code>tecdata\de\modules:</code> Testbausteine blafasel</li>
    <li>
        <ul>
            <li> Moduldateinamen haben folgendes Format <code> <Content>_<Type>_<Name></code> (Ausnahme: <code>ti_<Name>)</li>
            <li>
                <ul>
                    <li><Content> bezeichnet den semantischen Inhalt</li>
                    <li><code>ti_</code> diese ganzen Listen sind verwirrend
                </ul>
            </li>
        </ul>
    </li>
<p>And more stuff here...</p>";
HTML

And the following program.

# we will save the tag-looking words here
my %non_tags;

# (8) replace html with the concatenanted result
$html = join '', map {
    my $string = $_;

    # (2) find where the end-tag is
    my $pos = index($string, '</code>');
    if ($pos >= 0) {
        # (3) take string until the end-tag
        my $escaped = substr( $string, 0, $pos );

        # (4) remember the tag-looking words
        $non_tags{$_}++ foreach $escaped =~ m/<([^>]+)>/g;

        # (5) html-escape the <>
        $escaped =~ s/</&lt;/g;
        $escaped =~ s/>/&gt;/g;

        # (6) overwrite the not-escaped part with the newly escaped string
        substr( $string, 0, $pos ) = $escaped;
    }
    $string;
} split m/(<code>)/, $html; # (1) split on <code> but also take that delimiter

# html-escape those tag-looking words all over the text
foreach my $word ( keys %non_tags) {
    $html =~ s/<($word)>/&lt;$1&gt;/g ;
}

print $html;

The output from that is the following.

<ul>
    <li><code>tecdata\de\modules:</code> Testbausteine blafasel</li>
    <li>
        <ul>
            <li> Moduldateinamen haben folgendes Format <code> &lt;Content&gt;_&lt;Type&gt;_&lt;Name&gt;</code> (Ausnahme: <code>ti_&lt;Name&gt;)</li>
            <li>
                <ul>
                    <li>&lt;Content&gt; bezeichnet den semantischen Inhalt</li>
                    <li><code>ti_</code> diese ganzen Listen sind verwirrend
                </ul>
            </li>
        </ul>
    </li>
<p>And more stuff here...</p>";

As you can see, it html-escaped all tag-like words inside of <code></code> tokens. It also remembered what those words were, and then replaced further occurrences of those words that still look like tags. That way, we do not mess up actual HTML.

This is a very naive approach, but since this is a one-time task a crap solution that works is better than no solution.

Upvotes: 1

Related Questions