Johann Gerell
Johann Gerell

Reputation: 25581

Howto identify UTF-8 encoded strings

What's the best way to identify if a string (is or) might be UTF-8 encoded? The Win32 API IsTextUnicode isn't of much help here. Also, the string will not have an UTF-8 BOM, so that cannot be checked for. And, yes, I know that only characters above the ASCII range are encoded with more than 1 byte.

Upvotes: 18

Views: 14527

Answers (9)

As an add-on to the previous answer about the Win32 mlang DetectInputCodepage() API, here's how to call it in C:

#include <Mlang.h>
#include <objbase.h>
#pragma comment(lib, "ole32.lib")

HRESULT hr;
IMultiLanguage2 *pML;
char *pszBuffer;
int iSize;
DetectEncodingInfo lpInfo[10];
int iCount = sizeof(lpInfo) / sizeof(DetectEncodingInfo);

hr = CoInitialize(NULL);
hr = CoCreateInstance(&CLSID_CMultiLanguage, NULL, CLSCTX_INPROC_SERVER, &IID_IMultiLanguage2, (LPVOID *)&pML);
hr = pML->lpVtbl->DetectInputCodepage(pML, 0, 0, pszBuffer, &iSize, lpInfo, &iCount);

CoUninitialize();

But the test results are very disappointing:

  • It can't distinguish between French texts in CP 437 and CP 1252, even though the text is completely unreadable if opened in the wrong code page.
  • It can detect text encoded in CP 65001 (UTF-8), but not text in UTF-16, which is wrongly reported as CP 1252 with good confidence!

Upvotes: 0

Edward Wilde
Edward Wilde

Reputation: 26507

chardet character set detection developed by Mozilla used in FireFox. Source code

jchardet is a java port of the source from mozilla's automatic charset detection algorithm.

NCharDet is a .Net (C#) port of a Java port of the C++ used in the Mozilla and FireFox browsers.

Code project C# sample that uses Microsoft's MLang for character encoding detection.

UTRAC is a command line tool and library written in c++ to detect string encoding

cpdetector is a java project used for encoding detection

chsdet is a delphi project, and is a stand alone executable module for automatic charset / encoding detection of a given text or file.

Another useful post that points to a lot of libraries to help you determine character encoding http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html

You could also take a look at the related question How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?, it has some useful content.

Upvotes: 22

user90843
user90843

Reputation:

For Win32, you can use the mlang API, this is part of Windows and supported from Windows XP, cool thing about it is that it gives you statistics of how likely the input is to be in a particular encoding:

CComPtr<IMultiLanguage2> lang;
HRESULT hr = lang.CoCreateInstance(CLSID_CMultiLanguage, NULL, CLSCTX_INPROC_SERVER);
char* str = "abc"; // EF BB BF 61 62 63
int size = 6;
DetectEncodingInfo encodings[100];
int encodingsCount = 100;
hr = lang->DetectInputCodepage(MLDETECTCP_NONE, 0, str, &size, &encodings, &encodingsCount);

Upvotes: 2

Remy Lebeau
Remy Lebeau

Reputation: 595412

On Windows, you can use MultiByteToWideChar() with the CP_UTF8 codepage and the MB_ERR_INVALID_CHARS flag. If the function fails, the string is not valid UTF-8.

Upvotes: 1

Ryan
Ryan

Reputation: 14649

You didn't specify a language, but in PHP you can use mb_check_encoding

   if(mb_check_encoding($yourDtring, 'UTF-8'))
   {
   //the string is UTF-8
    }
   else 
    {
       //string is not UTF-8
     }

Upvotes: 2

Tom
Tom

Reputation: 2036

C/C++ standalone library based on Mozilla's character set detector

https://github.com/batterseapower/libcharsetdetect

Universal Character Set Detector (UCSD) A library exposing a C interface and dependency-free interface to the Mozilla C++ UCSD library. This library provides a highly accurate set of heuristics that attempt to determine the character set used to encode some input text. This is extremely useful when your program has to handle an input file which is supplied without any encoding metadata.

Upvotes: 1

Harry Wood
Harry Wood

Reputation: 2351

To do character detection in ruby install the 'chardet' gem

sudo gem install chardet

Here's a little ruby script to run chardet over the standard input stream.

require "rubygems"
require 'UniversalDetector' #chardet gem
infile =  $stdin.read()
p UniversalDetector::chardet(infile)

Chardet outputs a guess at the character set encoding and also a confidence level (0-1) from its statistical analysis

see also this snippet

Upvotes: 1

hamishmcn
hamishmcn

Reputation: 7981

This W3C page has a perl regular expression for validating UTF-8

Upvotes: 6

Laurent
Laurent

Reputation: 6205

There is no really reliable way, but basically, as a random sequence of bytes (e.g. a string in an standard 8 bit encoding) is very unlikely to be a valid UTF-8 string (if the most significant bit of a byte is set, there are very specific rules as to what kind of bytes can follow it in UTF-8), you can try decoding the string as UTF-8 and consider that it is UTF-8 if there are no decoding errors.

Determining if there were decoding errors is another problem altogether, many Unicode libraries simply replace invalid characters with a question mark without indicating whether or not an error occurred. So you need an explicit way of determining if an error occurred while decoding or not.

Upvotes: 7

Related Questions