How to make Mojolicious deal with UTF-8?

Question

Consider the following code. This way I get "Wide character in syswrite" for a file, and garbage in a browser:

use Mojolicious::Lite;
use Mojo::UserAgent;
use Mojo::File;

get '/' => sub {
    my $c = shift;
    my $ua  = Mojo::UserAgent->new;
    $res = $ua->get('https://...')->result;
    Mojo::File->new('resp')->spurt($res->dom->at('.some-selector')->text);
    $c->render(text => $res->body);
}

app->start;

But this way it works:

use Encode qw/encode_utf8 decode_utf8/;
Mojo::File->new('resp')->spurt(encode_utf8($res->dom->at('.some-selector')->text));
Mojo::File->new('resp')->spurt($res->body);
$c->render(text => decode_utf8($res->body));

Can you explain what's going on here? Why do the two of the statements not work without Encode module? Why does the second one work? Is there a better way to handle it? I've skimmed over perluniintro and perlunicode, but that's as far as I could get.

x-yuri · Accepted Answer

What I've understood from perluniintro, perlunicode, and xxfelixxx's link is that Unicode is a complex matter. You can't generally make it just work. There are bytes (octets) and text. Before handling input you most of the time have got to convert bytes to text (decode), and before outputting, you've got to do the reverse (encode). If it were not about third-party libraries, one could do use open qw( :encoding(UTF-8) :std );, or binmode. But with third-party libraries you are not always able to do so.

As such, $res->body is bytes, $res->text is text decoded from encoding specified in response. $res->dom takes $res->text as input. So, $res->dom->at('.some-selector')->text is text, and Mojo::File->new(...)->spurt() expects to get bytes. So you have no other way, but to encode it using UTF-8. And by the way, utf8 is not UTF-8. The latter is safer, so you'd better use encode/decode functions.

Then, $c->render(text => ...); expects text, not bytes. So you either have got to decode('UTF-8', $res->body), or pass $res->text.

How to make Mojolicious deal with UTF-8?

Answers (1)

Related Questions