How to get the number of characters in a file (not bytes) in C on Linux

Question

I would like to get the number of characters in a file. By characters I mean "real" characters, not bytes. Assuming I know the file encoding.

I tried to use mbstowcs() but it doesn't work because it uses the system locale (or the one defined with setlocale). Because setlocale is not thread-safe, I don't think it's a good idea to use it before calling mbstowcs(). Even if it was tread-safe, I would have to be sure that my program won't "jump" (signal, etc) between the calls of setlocale() (one call to set it to the encoding of the file, and on call to revert to the previous one).

So, to take an example, imagine we have a file ru.txt encoded using a russian encoding (KOI8 for example). So, I would like to open the file and get the numbers of characters, assuming the encoding of the file is KOI8.

It could be so easy if mbstowcs() could take a source_encoding argument...

EDIT: An other problem using mbstowcs() is that the locale corresponding to the encoding of the file has to be installed on the system...

M.E.L. · Accepted Answer

I'd suggest using iconv(3):

NAME
   iconv - perform character set conversion

SYNOPSIS
   #include 

   size_t iconv(iconv_t cd,
                char **inbuf, size_t *inbytesleft,
                char **outbuf, size_t *outbytesleft);

and convert to utf32. You get 4 byte output for every character converted (plus 2 for the BOM). It should be possible to convert the input piece by piece using a fix size outbuf, if one choses outbytesleft carefully (i.e. 4 * inbytesleft + 2 :-).

How to get the number of characters in a file (not bytes) in C on Linux

Answers (2)

Related Questions