dasp
dasp

Reputation: 907

Detect presence of a specific charset

I need a way to detect whether a file contains characters from a certain charset.

Specifically, I want to detect the presence of UTF8-encoded cyrillic characters in a series of files. Is there a tool to do this?

Thanks

Upvotes: 0

Views: 392

Answers (2)

drdaeman
drdaeman

Reputation: 11471

If you are looking for ready solution, you might want to try Enca.

However, if you only want to detect presence of what can be possibly decoded as UTF-8 Cyrillic characters (without any complete UTF-8 validity checks), you just have to grep for something like /(\xD0[\x81\x90-\xBF]|\xD1[\x80-\x8F\x91]){n,}/ (this exact regexp is for n subsequent UTF8-encoded Russian Cyrillic characters). For additional check that the whole file contains only valid UTF-8 data you can use something like isutf8(1).

Both methods have their good and bad sides and may sometimes give wrong results.

Upvotes: 2

Glen
Glen

Reputation: 22290

IIRC the ICU library has code that does character set detection. Though it's basically a best effort guess.

Edit: I did remember correctly, check out this paper / tutorial

Upvotes: 2

Related Questions