Reputation: 1572
I have a multi-byte string containing a mixture of japanese and latin characters. I'm trying to copy parts of this string to a separate memory location. Since it's a multi-byte string, some of the characters uses one byte and other characters uses two. When copying parts of the string, I must not copy "half" japanese characters. To be able to do this properly, I need to be able to determine where in the multi-byte string characters starts and ends.
As an example, if the string contains 3 characters which requires [2 byte][2 byte][1 byte], I must copy either 2, 4 or 5 bytes to the other location and not 3, since if I were copying 3 I would copy only half the second character.
To figure out where in the multi-byte string characters starts and ends, I'm trying to use the Windows API function CharNext and CharNextExA but without luck. When I use these functions, they navigate through my string one byte at a time, rather than one character at a time. According to MSDN, CharNext is supposed to The CharNext function retrieves a pointer to the next character in a string..
Here's some code to illustrate this problem:
#include <windows.h>
#include <stdio.h>
#include <wchar.h>
#include <string.h>
/* string consisting of six "asian" characters */
wchar_t wcsString[] = L"\u9580\u961c\u9640\u963f\u963b\u9644";
int main()
{
// Convert the asian string from wide char to multi-byte.
LPSTR mbString = new char[1000];
WideCharToMultiByte( CP_UTF8, 0, wcsString, -1, mbString, 100, NULL, NULL);
// Count the number of characters in the string.
int characterCount = 0;
LPSTR currentCharacter = mbString;
while (*currentCharacter)
{
characterCount++;
currentCharacter = CharNextExA(CP_UTF8, currentCharacter, 0);
}
}
(please ignore memory leak and failure to do error checking.)
Now, in the example above I would expect that characterCount becomes 6, since that's the number of characters in the asian string. But instead, characterCount becomes 18 because mbString contains 18 characters:
門阜陀阿阻附
I don't understand how it's supposed to work. How is CharNext supposed to know whether "é–€é" in the string is an encoded version of a Japanese character, or in fact the characters é – € and é?
Some notes:
EDIT: Apparantly the CharNext functions doesn't support UTF-8 but Microsoft forgot to document that. I threw/copiedpasted together my own routine, which I won't use and which needs improving. I'm guessing it's easily crashable.
LPSTR CharMoveNext(LPSTR szString)
{
if (szString == 0 || *szString == 0)
return 0;
if ( (szString[0] & 0x80) == 0x00)
return szString + 1;
else if ( (szString[0] & 0xE0) == 0xC0)
return szString + 2;
else if ( (szString[0] & 0xF0) == 0xE0)
return szString + 3;
else if ( (szString[0] & 0xF8) == 0xF0)
return szString + 4;
else
return szString +1;
}
Upvotes: 2
Views: 2027
Reputation: 241
static const char *CharNextUTF8(const char *psz)
{
// get the first char, and then move the
// pointer to the next byte by default.
BYTE c = (BYTE)*psz++;
// if the highest bit of the char is set ...
if (c & 0x80)
{
BYTE x = 0;
// count the continuous bits set after the highest bit,
// that means to calculate the count of following bytes.
while (c & 0x40)
{
c <<= 1;
x++;
}
// ok, there should be 'x' bytes following the first byte.
for (BYTE i = 0; i < x; i++)
{
// if any byte is not a valid following byte...
if ((psz[i] & 0xC0) != 0x80)
{
goto done;
}
}
// all the following bytes are valid,
// move the pointer to skip all.
psz += x;
}
done:
return psz;
}
Upvotes: 0
Reputation: 308462
Given that CharNextExA doesn't work with UTF-8, you can parse it yourself. Just skip over the characters that have 10 in the top two bits. You can see the pattern in the definition of UTF-8: http://en.wikipedia.org/wiki/Utf-8
LPSTR CharMoveNext(LPSTR szString)
{
++szString;
while ((*szString & 0xc0) == 0x80)
++szString;
return szString;
}
Upvotes: 3
Reputation: 2752
Try using 932 for the code page. I don't think CP_UTF8 is a real codepage, and it may only work for WideCharToMultibyte() and back. You can also try isleadByte(), but that requires either setting the locale correctly, or setting the default codepage correctly. I have successfully used IsDBCSLeadByteEx(), but never with CP_UTF8.
Upvotes: 0
Reputation: 792777
As far as I can determine (google and experimentation), CharNextExA
doesn't actually work with UTF-8, only supported multibyte encodings that use shorter lead/trail byte pairs or single byte characters.
UTF-8 is a fairly regular encoding, there are a lot of libraries that will do what you want but it's also fairly easy to roll your own.
Have a look in here unicode.org, particularly table 3-7 for valid sequence forms.
const char* NextUtf8( const char* in )
{
if( in == NULL || *in == '\0' )
return in;
unsigned char uc = static_cast<unsigned char>(*in);
if( uc < 0x80 )
{
return in + 1;
}
else if( uc < 0xc2 )
{
// throw error? invalid lead byte
}
else if( uc < 0xe0 )
{
// check in[1] for validity( 0x80 .. 0xBF )
return in + 2;
}
else if( uc < 0xe1 )
{
// check in[1] for validity( 0xA0 .. 0xBF )
// check in[2] for validity( 0x80 .. 0xBF )
return in + 3;
}
else // ... etc.
// ...
}
Upvotes: 3
Reputation: 135413
Here is a really good explanation of what is going on here at the Sorting it All Out blog: Is CharNextExA broken?. In short, CharNext is not designed to work with UTF8 strings.
Upvotes: 4
Reputation: 6076
This isn't a direct answer to your question, but you may find the following tutorial helpful, I certainly did. In fact the information provided here is enough that you should be able to traverse the multi-byte string yourself with ease:
Upvotes: 0