user2504649
user2504649

Reputation: 81

Bad encoding with WWW::Mechanize in Perl

I'm trying to post content through a website with WWW:Mechanize.

My content seems to be UTF-8 and the website where I post it is a page that specifies ISO-8859-15 encoding on the head of the HTML page.

The post works but I get this result

Example of the encoding I have (in French) :

acteur majeur de l?assurance et
référence en gestion
patrimoniale, propose une approche globale pour
une clientèle aisée et haut de gamme. 

Here is my code

use WWW::Mechanize;
use Encode;
use open qw(:std :utf8);

my $mech = WWW::Mechanize->new(
   stack_depth => 0,
   timeout => 10,
);

mech->get($urlContentOtherWebsite);
my $tree = HTML::TreeBuilder::XPath->new_from_content($mech->content); 
my $content = $tree->findvalue('/html/body//div[@id="content"]');
$tree->delete;
mech->get($urlFormMyWebsite);
$mech->form_name("formular"); # Form Post Emploi
$mech->set_fields(
  content => $content
);
$mech->submit;

have you some idea or clue to resolve my problem please?

Upvotes: 1

Views: 1152

Answers (2)

Steffen Ullrich
Steffen Ullrich

Reputation: 123320

From studying the code: HTML::Form, which is used inside WWW::Mechanize, uses the accept-charset parameter of the <form...> tag to find out which encoding to use. If there is no such parameter than it uses a default charset, which is UTF-8. You can set the acceptable charset with $form->accept_charset('iso-8859-1'), e.g. the following should work if I read the code correctly:

$mech->form_name("formular")->accept_charset('iso-8859-1');
$mech->set_fields(...);
$mech->submit;

Upvotes: 3

Borodin
Borodin

Reputation: 126722

You need to add

binmode STDOUT, ':encoding(utf-8)';

at the start of your program to declare that STDOUT is expecting UTF-8 characters, otherwise you will see the individual bytes instead of the proper characters

You also need to decode the input as UTF-8 using

use Encode;

followed by

decode('UTF-8', $_)

where the incoming text is in $_.

Here's an example

use utf8;
use strict;
use warnings;

use Encode;

binmode STDOUT, ':encoding(utf-8)';

print decode('UTF-8', $_) for <DATA>;

__DATA__
acteur majeur de l?assurance et
référence en gestion
patrimoniale, propose une approche globale pour
une clientèle aisée et haut de gamme. 

output

acteur majeur de l?assurance et
référence en gestion
patrimoniale, propose une approche globale pour
une clientèle aisée et haut de gamme. 

I don't quite understand l?assurance, but I imagine that the data has been altered somewhere between the original web site and the Stack Overflow post. As you can see, the rest of the text is correct

Upvotes: 1

Related Questions