Swiftlet
Swiftlet

Reputation: 55

Unknown charset accented characters convert to utf8

I have a website that users may enter an accented character search term. Since users may come from various countries, various OS, the charset accented characters they input may be encoded in windows-1252, iso-8859-1, or even iso-8859-X, windows-125X.

I am using Perl, and my index server is Solr 8, all data in utf8. I can use decode+encode to convert it if the source charset is known, but how could I convert an unknown accented to utf8? How could I detect the charset of the source accented characters, in Perl?

use utf8;
use Encode;
encode("utf8",decode("cp1252",$input));

Upvotes: 2

Views: 609

Answers (1)

Joop Eggen
Joop Eggen

Reputation: 109547

The web page and the form need to specify UTF-8.

Then the browser can accept any script, and will send it to the server as UTF-8.

The form's encoding prevents the browser sending HTML entities like ă for special chars.

Header:

Content-type: text/html; charset=UTF-8

With perl (empty line for end-of-headers):

print "Content-Type: text/html; charset=UTF-8\n\n";

HTML content; in HTML 5:

<!DOCTYPE html>
<html>
    <meta charset="UTF-8">
...
<form ... accept-charset="UTF-8"

Upvotes: 4

Related Questions