Reputation: 2663
I would like to get the number of characters in a file. By characters I mean "real" characters, not bytes. Assuming I know the file encoding.
I tried to use mbstowcs()
but it doesn't work because it uses the system locale (or the one defined with setlocale). Because setlocale is not thread-safe, I don't think it's a good idea to use it before calling mbstowcs()
. Even if it was tread-safe, I would have to be sure that
my program won't "jump" (signal, etc) between the calls of setlocale()
(one call to set it to the encoding of the file, and on call to revert to the previous one).
So, to take an example, imagine we have a file ru.txt encoded using a russian encoding (KOI8 for example). So, I would like to open the file and get the numbers of characters, assuming the encoding of the file is KOI8.
It could be so easy if mbstowcs()
could take a source_encoding
argument...
EDIT: An other problem using mbstowcs()
is that the locale corresponding to the encoding of the file has to be installed on the system...
Upvotes: 4
Views: 817
Reputation: 25386
To calculate the number of UTF8 characters in a file just pass it's content to this function:
int CalcUTF8Chars( const std::string& S )
{
int Count = 0;
for ( size_t i = 0; i != S.length(); i++ )
{
if ( ( S[i] & 0xC0 ) != 0x80 ) { Count++; }
}
return Count;
}
No external dependencies.
Update:
In case you want to handle other different encodings you have two choices:
Use a third-party library that can handle it, for example, ICU http://site.icu-project.org/
Write the calculation functions yourself for every encoding you want to use.
Upvotes: 0
Reputation: 613
I'd suggest using iconv(3):
NAME
iconv - perform character set conversion
SYNOPSIS
#include <iconv.h>
size_t iconv(iconv_t cd,
char **inbuf, size_t *inbytesleft,
char **outbuf, size_t *outbytesleft);
and convert to utf32. You get 4 byte output for every character converted (plus 2 for the BOM). It should be possible to convert the input piece by piece using a fix size outbuf, if one choses outbytesleft carefully (i.e. 4 * inbytesleft + 2 :-).
Upvotes: 5