Reputation: 55
I have a website that users may enter an accented character search term. Since users may come from various countries, various OS, the charset accented characters they input may be encoded in windows-1252, iso-8859-1, or even iso-8859-X, windows-125X.
I am using Perl, and my index server is Solr 8, all data in utf8. I can use decode+encode to convert it if the source charset is known, but how could I convert an unknown accented to utf8? How could I detect the charset of the source accented characters, in Perl?
use utf8;
use Encode;
encode("utf8",decode("cp1252",$input));
Upvotes: 2
Views: 609
Reputation: 109547
The web page and the form need to specify UTF-8.
Then the browser can accept any script, and will send it to the server as UTF-8.
The form's encoding prevents the browser sending HTML entities like ă
for special chars.
Header:
Content-type: text/html; charset=UTF-8
With perl (empty line for end-of-headers):
print "Content-Type: text/html; charset=UTF-8\n\n";
HTML content; in HTML 5:
<!DOCTYPE html>
<html>
<meta charset="UTF-8">
...
<form ... accept-charset="UTF-8"
Upvotes: 4