Rotem Varon
Rotem Varon

Reputation: 1647

How do I find the character encoding for a file?

I have an XML that does not include the encoding (charset / Character encoding / character set / character map / codeset / code page). This is an example for one that does:

<?xml version="1.0" encoding="UTF-8"?>

The XML is being generated by a Perl script and the following is an excerpt:

$fileName = $exportDirectory . $fileName;
open FILE, ">$fileName" or die;

The questions:

  1. In this case, is there an easy way to find the encoding for the generated XML?
  2. The script querying other sources of information (like Oracle database) and appends the data to the XML file. Is the charset encoding dictated by the source of information? Or by the open file operation?
  3. In general, is there an easy way to find the encoding of arbitrary file?

I tried to use LibXML:

perl -MXML::LibXML -e 'XML::LibXML->load_xml(location => "2.xml")' 2.xml:1364531: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xBF 0x30 0x39 0x20 female presented in spring �09 due t ^

I hope I supplied sufficient information. Please let me know if further information is needed.

Upvotes: 0

Views: 214

Answers (1)

Karol S
Karol S

Reputation: 9421

You can use enca or chardet.

You may have to compile enca yourself. As for chardet, there's a chance your repo contains a packaged script.

Enca works only for European languages and requires you to tell it which language the file is in. Chardet does worse in differentiating European languages encoded with 8-bit encodings, but performs better with non-European text.

Upvotes: 1

Related Questions