Anne Quinn
Anne Quinn

Reputation: 12992

Can I use wstring to read, parse, and emit utf-8?

I'm writing a program that reads translations (EN, JP, SP) from a single .csv file, parses it, then emits it to another file. 8 bits per character isn't enough for this, but using wstring and wchar_t has only managed to scramble the text I read from file. I'm honestly not really sure where to begin here, researching the topic mostly just finds me strong opinions on the subject, and little useful info.

Is wstring able to do utf-8? Is utf-8 even the thing I should be concerned with?

If I have a u8"string" or L"string" which contains characters from multiple languages, how would I write this to file using only the C standard IO library?

(I am extremely determined to make this work using only the standard IO library, even if it means writing it one byte at a time)

Upvotes: 2

Views: 389

Answers (2)

n. m. could be an AI
n. m. could be an AI

Reputation: 119847

Is wstring able to do utf-8?

C++ has standard functions (wstring_convert) that are able to convert between wstring and UTF-8 strings. There are also standard functions in both C and C++ (wcstombs, mbstowcs), that may be able to do the same with C-wstrings if your system has an appropriate locale. Most POSIX-is systems do, Windows-based ones normally don't (they have non-standard facilities for that). That's about all wstring and UTF-8 have to do with each other.

Is utf-8 even the thing I should be concerned with?

It depends. If you are living in 1980, or don't do any programming, then probably not. If you don't do any character-level processing, and only shuffle entire strings, you should also be fine. Just use char-based strings and don't worry about any fancy characters. It all should work more or less automaticaly.

If you do need character-level stuff (substrings, search, ...) you probably do need to be aware of UTF-8. It's probably wise to do all internal processing with either wchar_t or char32_t, and convert from or to UTF-8 upon I/O. (I would just say "use wchar_t" but alas, on Windows wchar_t is broken. You may still be able to get away with it, but no promises.)

If I have a u8"string" or L"string" which contains characters from multiple languages, how would I write this to file using only the C standard IO library?

You cannot do much about u8"string" in C. In C++, they are normal char-based strings and can be written as any other string, and do the right thing. (You may have to jump through some hoops on Windows, see _setmode and _O_U8TEXT docs). Thia is however of a minor importance. You nirmally don't need to have any fancy characters in string literals. All user-facing strings should be loaded from files.

With wchar_t based strings, you may or may not be able to output UTF-8 directly, depending on your OS and compiler. You can always convert to UTF-8 and output that.

If you are willing to use third-party libraries, consider using http://utfcpp.sourceforge.net/

Also read: http://utf8everywhere.org http://www.joelonsoftware.com/articles/Unicode.html

Upvotes: 3

Rabbid76
Rabbid76

Reputation: 210878

Convert from wstring to utf8:

#include <string>
#include <codecvt>

std::wstring wstring_convert_from_char( const char *str )
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> converter;
    return converter.from_bytes( str );
}

std::string string_convert_from_wchar( const wchar_t *str )
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> converter;
    return converter.to_bytes( str );
}

Upvotes: 2

Related Questions