Anteru
Anteru

Reputation: 19404

What is the best way to store UTF-8 strings in memory in C/C++?

Looking at the unicode standard, they recommend to use plain chars for storing UTF-8 encoded strings. Does this work as expected with C++ and the basic std::string, or do cases exist in which the UTF-8 encoding can create problems?

For example, when computing the length, it may not be identical to the number of bytes - how is this supposed to be handled? Reading the standard, I'm probably fine using a char array for storage, but I'll still need to write functions like strlen etc. on my own, which work on encoded text, cause as far as I understand the problem, the standard routines are either ASCII only, or expect wide literals (16bit or more), which are not recommended by the unicode standard. So far, the best source I found about the encoding stuff is a post on Joel's on Software, but it does not explain what we poor C++ developer should use :)

Upvotes: 10

Views: 7841

Answers (6)

David Allan Finch
David Allan Finch

Reputation: 1424

It depends what you want to do with the UTF8 String. If all you are interested in is reading in and out UTF8 strings then it all works as long as you have set the correct locale. We have done this for some time. We have several server process that do nothing with strings as such. There strings are set by the user in Java and arrive as UTF8 and we handle them in standard c str buffers. We then send the data back to Java that converts it back.

If you want the length in UTF8 characters then you want functions that can handle the translation for you.

But you can roll your own for example utf8-strlen

Upvotes: 2

sastanin
sastanin

Reputation: 41541

From UTF-8 and Unicode FAQ: C support for Unicode:

#include <stdio.h>
#include <locale.h>

int main()
{
  if (!setlocale(LC_CTYPE, "")) {
    fprintf(stderr, "Can't set the specified locale! "
            "Check LANG, LC_CTYPE, LC_ALL.\n");
    return 1;
  }
  printf("%ls\n", L"Schöne Grüße");
  return 0;
}

Also from here:

The good news is that if you use wchar_t* strings and the family of functions related to them such as wprintf, wcslen, and wcslcat, you are dealing with Unicode values. In the C++ world, you can use std::wstring to provide a friendly interface. My only complaint is that these are 32-bit (4 byte) characters, so they are memory hogs for all languages. The reason for this choice is that it guarantees each possible character can be represented by one value.

PS. This is probably Linux-specific. There is a ICU library to handle complicated things.

Upvotes: 0

sastanin
sastanin

Reputation: 41541

An example with ICU library (C, C++, Java):

#include <iostream>
#include <unicode/unistr.h> // using ICU library

int main(int argc, char *argv[]) {
    // constructing a Unicode string
    UnicodeString ustr1("Привет"); // using platform's default codepage
    // calculating the length in characters, should be 6
    int ulen1=ustr1.length();
    // extracting encoded characters from a string
    int const bufsize=25;
    char encoded[bufsize];
    ustr1.extract(0,ulen1,encoded,bufsize,"UTF-8"); // forced UTF-8 encoding
    // printing the result
    std::cout << "Length of " << encoded << " is " << ulen1 << "\n";
    return 0;
}

building like

$ g++ -licuuc -o icu-example{,.cc}

running

$ ./icu-example
Length of Привет is 6

Works for me on Linux with GCC 4.3.2 and libicu 3.8.1. Please note that it prints in UTF-8 no matter what the system locale is. You won't see it correctly if yours is not UTF-8.

Upvotes: 3

MSalters
MSalters

Reputation: 179991

strlen counts the number of non-null chars before the first \0. In UTF-8, that count is a sane number (number of bytes used), but the count is not the number of characters (one UTF-8 character is typically 1-4 chars). basic_string doesn't store a \0, but it too keeps a byte count.

strcpy or the basic_string copy ctor copy all bytes without looking too closely.

Finding a substring works OK, because of the way UTF_8 is encoded. The allowed values for the first byte of a character is distinct from the second to 4th byte (the former never start with 10xxxxxx, the latter always)

Taking a substring is tricky - how do you specify the position? If the begin and end were found by searching for ASCII text markers (e.g. [ and ]) then there's no problem. You'd just get the bytes in the middle, which are a valid UTF8 string too. You can't harcode positions, or even relative offsets though. Even a relative offset of +1 character can be hard; how many bytes is that? You will end up writing a function like SkipOneChar.

Upvotes: 3

user52875
user52875

Reputation: 3058

What we settled with: store UTF8 in a std::string. You can do most operations now, except for things like computing the length. Use a UTF8->std::wstring conversion function (boost::from_utf8, for example) to convert to a std::wstring when you need such operations.

Upvotes: 1

Carl Seleborg
Carl Seleborg

Reputation: 13305

There's a library called "UTF8-CPP", which lets you store your UTF-8 strings in standard std::string objects, and provides additional functions to enumerate and manipulate utf-8 characters.

I haven't tested it yet, so I don't know what it's worth, but I am considering using it myself.

Upvotes: 5

Related Questions