Reputation: 25581

Howto identify UTF-8 encoded strings

What's the best way to identify if a string (is or) might be UTF-8 encoded? The Win32 API IsTextUnicode isn't of much help here. Also, the string will not have an UTF-8 BOM, so that cannot be checked for. And, yes, I know that only characters above the ASCII range are encoded with more than 1 byte.

Upvotes: 18

Answers (9)

Jean-François Larvoire

Reputation: 1423

As an add-on to the previous answer about the Win32 mlang DetectInputCodepage() API, here's how to call it in C:

#include <Mlang.h>
#include <objbase.h>
#pragma comment(lib, "ole32.lib")

HRESULT hr;
IMultiLanguage2 *pML;
char *pszBuffer;
int iSize;
DetectEncodingInfo lpInfo[10];
int iCount = sizeof(lpInfo) / sizeof(DetectEncodingInfo);

hr = CoInitialize(NULL);
hr = CoCreateInstance(&CLSID_CMultiLanguage, NULL, CLSCTX_INPROC_SERVER, &IID_IMultiLanguage2, (LPVOID *)&pML);
hr = pML->lpVtbl->DetectInputCodepage(pML, 0, 0, pszBuffer, &iSize, lpInfo, &iCount);

CoUninitialize();

But the test results are very disappointing:

It can't distinguish between French texts in CP 437 and CP 1252, even though the text is completely unreadable if opened in the wrong code page.
It can detect text encoded in CP 65001 (UTF-8), but not text in UTF-16, which is wrongly reported as CP 1252 with good confidence!

Upvotes: 0

Edward Wilde

Reputation: 26507

chardet character set detection developed by Mozilla used in FireFox. Source code

jchardet is a java port of the source from mozilla's automatic charset detection algorithm.

NCharDet is a .Net (C#) port of a Java port of the C++ used in the Mozilla and FireFox browsers.

Code project C# sample that uses Microsoft's MLang for character encoding detection.

UTRAC is a command line tool and library written in c++ to detect string encoding

cpdetector is a java project used for encoding detection

chsdet is a delphi project, and is a stand alone executable module for automatic charset / encoding detection of a given text or file.

Another useful post that points to a lot of libraries to help you determine character encoding http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html

You could also take a look at the related question How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?, it has some useful content.

Upvotes: 22

user90843

Reputation:

For Win32, you can use the mlang API, this is part of Windows and supported from Windows XP, cool thing about it is that it gives you statistics of how likely the input is to be in a particular encoding:

CComPtr<IMultiLanguage2> lang;
HRESULT hr = lang.CoCreateInstance(CLSID_CMultiLanguage, NULL, CLSCTX_INPROC_SERVER);
char* str = "ï»¿abc"; // EF BB BF 61 62 63
int size = 6;
DetectEncodingInfo encodings[100];
int encodingsCount = 100;
hr = lang->DetectInputCodepage(MLDETECTCP_NONE, 0, str, &size, &encodings, &encodingsCount);

Upvotes: 2

Remy Lebeau

Reputation: 595412

On Windows, you can use MultiByteToWideChar() with the CP_UTF8 codepage and the MB_ERR_INVALID_CHARS flag. If the function fails, the string is not valid UTF-8.

Upvotes: 1

Ryan

Reputation: 14649

You didn't specify a language, but in PHP you can use mb_check_encoding

   if(mb_check_encoding($yourDtring, 'UTF-8'))
   {
   //the string is UTF-8
    }
   else 
    {
       //string is not UTF-8
     }

Upvotes: 2

Tom

Reputation: 2036

C/C++ standalone library based on Mozilla's character set detector

https://github.com/batterseapower/libcharsetdetect

Universal Character Set Detector (UCSD) A library exposing a C interface and dependency-free interface to the Mozilla C++ UCSD library. This library provides a highly accurate set of heuristics that attempt to determine the character set used to encode some input text. This is extremely useful when your program has to handle an input file which is supplied without any encoding metadata.

Upvotes: 1

Harry Wood

Reputation: 2351

To do character detection in ruby install the 'chardet' gem

sudo gem install chardet

Here's a little ruby script to run chardet over the standard input stream.

require "rubygems"
require 'UniversalDetector' #chardet gem
infile =  $stdin.read()
p UniversalDetector::chardet(infile)

Chardet outputs a guess at the character set encoding and also a confidence level (0-1) from its statistical analysis

Howto identify UTF-8 encoded strings

Answers (9)

Related Questions