Reputation: 18883
Consider the following code. This way I get "Wide character in syswrite" for a file, and garbage in a browser:
use Mojolicious::Lite;
use Mojo::UserAgent;
use Mojo::File;
get '/' => sub {
my $c = shift;
my $ua = Mojo::UserAgent->new;
$res = $ua->get('https://...')->result;
Mojo::File->new('resp')->spurt($res->dom->at('.some-selector')->text);
$c->render(text => $res->body);
}
app->start;
But this way it works:
use Encode qw/encode_utf8 decode_utf8/;
Mojo::File->new('resp')->spurt(encode_utf8($res->dom->at('.some-selector')->text));
Mojo::File->new('resp')->spurt($res->body);
$c->render(text => decode_utf8($res->body));
Can you explain what's going on here? Why do the two of the statements not work without Encode
module? Why does the second one work? Is there a better way to handle it? I've skimmed over perluniintro and perlunicode, but that's as far as I could get.
Upvotes: 7
Views: 1478
Reputation: 18883
What I've understood from perluniintro, perlunicode, and xxfelixxx's link is that Unicode is a complex matter. You can't generally make it just work. There are bytes (octets) and text. Before handling input you most of the time have got to convert bytes to text (decode
), and before outputting, you've got to do the reverse (encode
). If it were not about third-party libraries, one could do use open qw( :encoding(UTF-8) :std );
, or binmode
. But with third-party libraries you are not always able to do so.
As such, $res->body
is bytes, $res->text
is text decoded from encoding specified in response. $res->dom
takes $res->text
as input. So, $res->dom->at('.some-selector')->text
is text, and Mojo::File->new(...)->spurt()
expects to get bytes. So you have no other way, but to encode it using UTF-8. And by the way, utf8
is not UTF-8
. The latter is safer, so you'd better use encode
/decode
functions.
Then, $c->render(text => ...);
expects text, not bytes. So you either have got to decode('UTF-8', $res->body)
, or pass $res->text
.
Upvotes: 6