ppant
ppant

Reputation: 762

Unable to encode to iso-8859-1 encoding for some chars using Perl Encode module

I have a HTML string in ISO-8859-1 encoding. I need to pass this string to HTML:Entities::decode_entities() for converting some of the HTML ASCII codes to respective chars. To so i am using a module HTML::Parser::Entities 3.65 but after decode_entities() operation my whole string changes to utf-8 string. This behavior seems fine as the documentation of the HTML::Parse. As i need this string back in ISO-8859-1 format for further processing so i have used Encode::encode("iso-8859-1",$str) to change the string back to ISO-8859-1 encoding. My results are fine excepts for some chars, a question mark is coming instead. One example is single quote ' ASCII code (’)

Can anybody help me if there any limitation of Encode module? Any other pointer will also be helpful to solve the problem. I am pasting the sample text having the char causing the issue:

my $str = "This is a test string to test the encoding of some chars like ’ “ ” etc these are failing to encode; some of them which encode correctly are é « etc.";

Thanks

Upvotes: 1

Views: 2994

Answers (2)

cjm
cjm

Reputation: 62089

The fundamental problem is that the characters represented by ’, “, and ” do not exist in ISO-8859-1. You'll have to decide what it is that you want to do with them.

Some possibilities:

Use cp1252, Microsoft's "extended" version of ISO-8859-1, instead of the real thing. It does include those characters.

Re-encode the entities outside the ISO-8859-1 range (plus &), before converting from utf-8 to ISO-8859-1:

my $toEncode = do { no warnings 'utf8'; "&\x{0100}-\x{10FFFF}" };
$string = HTML::Entities::encode_entities($string, $toEncode);

(The no warnings bit is needed because U+10FFFF hasn't actually been assigned yet.)

There are other possibilities. It really depends on what you're trying to accomplish.

Upvotes: 1

Snake Plissken
Snake Plissken

Reputation: 678

There's a third argument to encode, which controls the checking it does. The default is to use a substitution character, but you can set it to FB_CROAK to get an error message.

Upvotes: 2

Related Questions