Thibaut D.
Thibaut D.

Reputation: 2663

How to get the number of characters in a file (not bytes) in C on Linux

I would like to get the number of characters in a file. By characters I mean "real" characters, not bytes. Assuming I know the file encoding.

I tried to use mbstowcs() but it doesn't work because it uses the system locale (or the one defined with setlocale). Because setlocale is not thread-safe, I don't think it's a good idea to use it before calling mbstowcs(). Even if it was tread-safe, I would have to be sure that my program won't "jump" (signal, etc) between the calls of setlocale() (one call to set it to the encoding of the file, and on call to revert to the previous one).

So, to take an example, imagine we have a file ru.txt encoded using a russian encoding (KOI8 for example). So, I would like to open the file and get the numbers of characters, assuming the encoding of the file is KOI8.

It could be so easy if mbstowcs() could take a source_encoding argument...

EDIT: An other problem using mbstowcs() is that the locale corresponding to the encoding of the file has to be installed on the system...

Upvotes: 4

Views: 817

Answers (2)

Sergey K.
Sergey K.

Reputation: 25386

To calculate the number of UTF8 characters in a file just pass it's content to this function:

int CalcUTF8Chars( const std::string& S )
{
    int Count = 0;

    for ( size_t i = 0; i != S.length(); i++ )
    {
        if ( ( S[i] & 0xC0 ) != 0x80 ) { Count++; }
    }

    return Count;
}

No external dependencies.

Update:

In case you want to handle other different encodings you have two choices:

  1. Use a third-party library that can handle it, for example, ICU http://site.icu-project.org/

  2. Write the calculation functions yourself for every encoding you want to use.

Upvotes: 0

M.E.L.
M.E.L.

Reputation: 613

I'd suggest using iconv(3):

NAME
   iconv - perform character set conversion

SYNOPSIS
   #include <iconv.h>

   size_t iconv(iconv_t cd,
                char **inbuf, size_t *inbytesleft,
                char **outbuf, size_t *outbytesleft);

and convert to utf32. You get 4 byte output for every character converted (plus 2 for the BOM). It should be possible to convert the input piece by piece using a fix size outbuf, if one choses outbytesleft carefully (i.e. 4 * inbytesleft + 2 :-).

Upvotes: 5

Related Questions