Reputation: 1714
I am having this std::string which contains some characters that span multiple bytes.
When I do a substring on this string, the output is not valid, because ofcourse, these characters are counted as 2 characters. In my opinion I should be using a wstring instead, because it will store these characters in as one element instead of more.
So I decided to copy the string into a wstring, but ofcourse this does not make sense, because the characters remain split over 2 characters. This only makes it worse.
Is there a good solution on converting a string to a wstring, merging the special characters into 1 element instead of 2.
Thanks
Upvotes: 4
Views: 10552
Reputation: 299920
Unicode is hard.
std::wstring
is not a list of codepoints, it's a list of wchar_t
, and their width is implementation-defined (commonly 16 bits with VC++ and 32 bits with gcc and clang). Yes, it means it's useless for portable code...LL
is considered a letter on its own in Spanish).So... it's a bit hard.
Solving 3) may be costly (it requires specific language/usage annotations); solving 1) and 2) is absolutely necessary... and requires Unicode aware libraries or coding your own (and probably getting it wrong).
uint32_t
)Otherwise, there is probably what you seek in ICU. I wish you good luck finding it.
Upvotes: 1
Reputation: 4320
Based on this I've written my utf8 substring function:
void utf8substr(std::string originalString, int SubStrLength, std::string& csSubstring)
{
int len = 0, byteIndex = 0;
const char* aStr = originalString.c_str();
size_t origSize = originalString.size();
for (byteIndex=0; byteIndex < origSize; byteIndex++)
{
if((aStr[byteIndex] & 0xc0) != 0x80)
len += 1;
if(len >= SubStrLength)
break;
}
csSubstring = originalString.substr(0, byteIndex);
}
Upvotes: 1
Reputation: 41685
Simpler version. based on the solution provided Getting the actual length of a UTF-8 encoded std::string? by Marcelo Cantos
std::string substr(std::string originalString, int maxLength)
{
std::string resultString = originalString;
int len = 0;
int byteCount = 0;
const char* aStr = originalString.c_str();
while(*aStr)
{
if( (*aStr & 0xc0) != 0x80 )
len += 1;
if(len>maxLength)
{
resultString = resultString.substr(0, byteCount);
break;
}
byteCount++;
aStr++;
}
return resultString;
}
Upvotes: 7
Reputation: 94329
A std::string
object is not a string of characters, it's a string of bytes. It has no notion of what's called "encoding" at all. Same goes for std::wstring
, except that it's a string of 16bit values.
In order to perform operations on your text which require addressing distinct characters (as is the case when you want to take the substring, for instance) you need to know what encoding is used for your std::string object.
UPDATE: Now that you clarified that your input string is UTF-8 encoded, you still need to decide on an encoding to use for your output std::wstring
. UTF-16 comes to mind, but it really depends on what the API which you will pass the std::wstring
objects to expect. Assuming that UTF-16 is acceptable you have various choices:
MultiByteToWideChar
function; no extra dependencies required.Upvotes: 5
Reputation: 153929
There are really only two possible solutions. If you're doing this a
lot, over large distances, you'd be better off converting your
characters to a single element encoding, using wchar_t
(or int32_t
,
or whatever is most appropriate. This is not a simple copy, which
would convert each individual char
into the target type, but a true
conversion function, which would recognize the multibyte characters, and
convert them into a single element.
For occasional use or shorter sequences, it's possible to write your own
functions for advancing n
bytes. For UTF-8, I use the following:
inline size_t
size(
Byte ch )
{
return byteCountTable[ ch ] ;
}
template< typename InputIterator >
InputIterator
succ(
InputIterator begin,
size_t size,
std::random_access_iterator_tag )
{
return begin + size ;
}
template< typename InputIterator >
InputIterator
succ(
InputIterator begin,
size_t size,
std::input_iterator_tag )
{
while ( size != 0 ) {
++ begin ;
-- size ;
}
return begin ;
}
template< typename InputIterator >
InputIterator
succ(
InputIterator begin,
InputIterator end )
{
if ( begin != end ) {
begin = succ( begin, end, size( *begin ),
std::::iterator_traits< InputIterator >::iterator_category() ) ;
}
return begin ;
}
template< typename InputIterator >
size_t
characterCount(
InputIterator begin,
InputIterator end )
{
size_t result = 0 ;
while ( begin != end ) {
++ result ;
begin = succ( begin, end ) ;
}
return result ;
}
Upvotes: 1
Reputation: 1109
Let me assume for simplicity that your encoding is UTF-8. In this case we would have some chars occupying more than one byte, as in your case. Then you have std::string, where those UTF-8 encoded characters are stored. And now you want to substr() in terms of chars, not bytes. I'd write a function that will convert character length to byte length. For the utf 8 case it would look like:
#define UTF8_CHAR_LEN( byte ) (( 0xE5000000 >> (( byte >> 3 ) & 0x1e )) & 3 ) + 1
int32 GetByteCountForCharCount(const char* utf8Str, int charCnt)
{
int ByteCount = 0;
for (int i = 0; i < charCnt; i++)
{
int charlen = UTF8_CHAR_LEN(*utf8Str);
ByteCount += charlen;
utf8Str += charlen;
}
return ByteCount;
}
So, say you want to substr() the string from 7-th char. No problem:
int32 pos = GetByteCountForCharCount(str.c_str(), 7);
str.substr(pos);
Upvotes: 0