Escaping special Tags for HTML

Question

I'm pretty new to perl and recently wrote a converter for our SharePoint. It basically takes our old wiki's html pages and converts them to aspx pages with SP classes and so on.

Everything works fine till the point someone used as Text. Here's an example form the html of the old twiki:

Moduldateinamen haben folgendes Format ___ ...

So is text wrapped in a Tag



How it looks in old wiki:



How it looks after converting to aspx and uploaded to SharePoint:


You can see that SP tried to interpret them as tags (sure), not as text and therefore it wont be displayed.

For SharePoint Pages I need an escaped HTML markup between SP ASPX markup.
So I've changed f.e. < to < and so on via regex.

However the example snippet I posted should look like this in the ASPX:

<li> Moduldateinamen haben folgendes Format <code>>openTagContentclosingTag_openTagTypeclosingTag_openTagNameclosingTag_</code>


So < converts to openTag and > to closing Tag but only for actual content between this

 tag. Later this needs to be changed by hand (I don't see another way)

How can I achieve that only "text" tags get escaped with openTag/closingTag but "real" HTML markup gets escape in this manner < to <

simbabque · Accepted Answer

I fiddled around and found a solution that might work. It does work with example data, but I don't know how complicated your actual documents are.

Consider this example input.

my $html = <<'HTML';

    tecdata\de\modules: Testbausteine blafasel
    
        
             Moduldateinamen haben folgendes Format  __ (Ausnahme: ti_)

            
                
                     bezeichnet den semantischen Inhalt
                    ti_ diese ganzen Listen sind verwirrend
                
            
        

    

And more stuff here...";
HTML



And the following program.

# we will save the tag-looking words here
my %non_tags;

# (8) replace html with the concatenanted result
$html = join '', map {
    my $string = $_;

    # (2) find where the end-tag is
    my $pos = index($string, '');
    if ($pos >= 0) {
        # (3) take string until the end-tag
        my $escaped = substr( $string, 0, $pos );

        # (4) remember the tag-looking words
        $non_tags{$_}++ foreach $escaped =~ m/<([^>]+)>/g;

        # (5) html-escape the <>
        $escaped =~ s//>/g;

        # (6) overwrite the not-escaped part with the newly escaped string
        substr( $string, 0, $pos ) = $escaped;
    }
    $string;
} split m/()/, $html; # (1) split on  but also take that delimiter

# html-escape those tag-looking words all over the text
foreach my $word ( keys %non_tags) {
    $html =~ s/<($word)>/<$1>/g ;
}

print $html;


The output from that is the following.


    tecdata\de\modules: Testbausteine blafasel
    
        
             Moduldateinamen haben folgendes Format  <Content>_<Type>_<Name> (Ausnahme: ti_<Name>)

            
                
                    <Content> bezeichnet den semantischen Inhalt
                    ti_ diese ganzen Listen sind verwirrend
                
            
        

    

And more stuff here...";


As you can see, it html-escaped all tag-like words inside of  tokens. It also remembered what those words were, and then replaced further occurrences of those words that still look like tags. That way, we do not mess up actual HTML.

This is a very naive approach, but since this is a one-time task a crap solution that works is better than no solution.

Escaping special Tags for HTML

Answers (2)

Related Questions