CJ7
CJ7

Reputation: 23275

Convert unicode to HTML entities function

I have the following function that converts unicode to HTML entities, but if I run the function again over the result it will not leave the HTML entities in tact. How can I get the function to leave already converted HTML entities alone?

sub convert_unicode {
    use HTML::Entities;
    use Encode;
    my $str = shift;
    Encode::_utf8_off($str);
    return encode_entities(decode('utf8',$str));
}

Upvotes: -1

Views: 1766

Answers (1)

Schwern
Schwern

Reputation: 164809

What you're asking for is to be able to safely double character encode. Some encodings allow this. HTML character encoding does not because it uses certain characters like & to do the encoding and it cannot tell the difference between a special character being used for encoding and one that needs to be encoded.

For example...

use HTML::Entities;
use v5.10;
say encode_entities("&foo");

That produces &foo. If we encode it again it produces &foo because & is a special character which it faithfully encodes. It does not know that & is an already encoded & so it treats it as a literal & and encodes it.

You could write your own custom HTML encoding function that assumes &xxx; (and its variants) are already encoded, but that's just a guess. You can't actually tell a literal &foo; and an encoded &foo; apart. It will break with, for example, old school Perl code like &function;. Maybe you can be super clever and use an array of objects to indicate which parts are encoded and have the whole thing overload stringification so it looks like a string, and so long as everything carefully preserves that object that looks like a string it'll work...

And now we're into the lava flow anti-pattern where rather than fixing bad design, more complex and bad design is layered on top of it. Trying to "fix" that will just create more problems. The real problem lies deeper.


The real problem is that you're encoding multiple times. This probably means you've wielded your formatting and your functionality together. For example...

sub get_user_name {
    my $uid = shift;

    my $name = ...do a bunch of work to get the user name...

    return encode_entities($name);
}

By HTML encoding the data, a function like this makes assumptions about how the data is going to be used. It limits its use to just HTML. If all your functions do this, you have a double encoding problem.

Then maybe you have something like this:

sub do_something {
    my $uid = shift;

    # $name is already HTML encoded.
    my $name = get_user_name($uid);

    my $stuff = ...something incorporating $name...

    # Whoops, the user name is double encoded.
    return encode_entities($stuff);
}

The answer is to leave the HTML formatting and encoding until the last minute. Ideally don't do it at all, just work with data and let an HTML template system take care of it. Template Toolkit, for example.

This also provides a clean separation between the formatting and the code, so now non-programmers can work on the formatting using a documented template system.

Upvotes: 3

Related Questions