Reputation: 53193
Most of answers and questions here on SO use to put L
before any UTF-8 string. I found no explantion of what it is, in the source code, the constant is, according to my IDE, defined in winnt.h
.
This is how I use it, without knowing what it is:
std::wcout<<L"\"Přetečení zásobníku\" is Stack overflow in Czech.";
Obviously, constant concatenation cannot be applied on variables:
void printUTF8(const char* str) {
//Does not make the slightest bit of sense
std::wcout<<L str;
}
So what is it and how to add it to dynamic strings?
Upvotes: 0
Views: 2273
Reputation: 145289
Re your actual question
” what is [the
L
prefix] and how to add it to dynamic strings?
This is very different from the title of the question at the time I’m writing this, namely “How can I make dynamic strings to work with UTF-8 in console?”
In short, UTF-8 is an encoding of Unicode where the basic encoding unit is 8 bits, commonly called a byte (more precisely it's an octet), while the L
prefix forms a wide character or string literal, where the encoding unit typically is 16 or 32 bits – in Windows it’s 16 bits, as in original Unicode.
A wide character or string literal is based on the wchar_t
type instead of char
.
In Windows a wide string is encoded as UTF-16. The most common sixty thousand or so Unicode characters are represented with single wchar_t
values, but some seldom used Chinese ideograms etc. require two successive wchar_t
values, called a surrogate pair.
The use of 16 bit encoding unit in Windows was established around 1992. I am not sure when UTF-16 was adopted (as an extension of then UCS-2 encoding), it was just a bit later. So this was established long before C99 required that all characters of the wide character set should be representable with single wchar_t
values. That requirement appears to have been a pure political maneuver, ensuring that no Windows C compiler could be formally conforming, a general ISO programming language standard that applied only to Unix-land. Unfortunately, since C++11 was based on C99 we now have that also in C++11, ensuring that no Windows C++ compiler can be fully conforming. Pure idiocy. If you ask me.
Errata, re deleted text above: according to Wikipedia’s article about it the wording about a single wchar_t
being sufficient for any character in the “extended character set” was there already in C90. Which makes the incompatibility between Windows and the C and C++ standards the fault of Microsoft, not the fault of the C committee. It still appears to be political and fairly idiotic, but (enlightened) with others to blame than I maintained at first…
One way to work with wide dynamic strings is to use std::wstring
, from the <string>
header.
With Visual C++ you can use a wmain
function instead of standard main
, as an easy way to get wide command line arguments.
wmain
is also supported by MinGW64 (IIRC) g++, although not yet by ordinary MinGW g++, as of g++ 4.8.something. It is however easy to implement in terms of the Windows API. Unless you require strict standard-conforming code that provides the special main function features such as ability to declare it with or without arguments, but hey, let's be practical about things.
Example that compiles fine with both Visual C++ 12.0 and g++ 4.8.2:
// Source encoding: UTF-8 with BOM.
#include <io.h> // _setmode
#include <fcntl.h> // _O_WTEXT
#include <iostream> // std::wcout, std::endl
#include <string> // std::wstring
using namespace std;
auto main()
-> int
{
_setmode( _fileno( stdin ), _O_WTEXT );
_setmode( _fileno( stdout ), _O_WTEXT );
wcout << L"Hi, what’s your name? ";
wstring username;
getline( wcin, username );
wcout << L"Welcome to Windows C++, " << username << "!" << endl;
}
Note that with Windows ANSI source this won’t compile with g++ unless you specify the source encoding with the appropriate compiler option.
Upvotes: 0
Reputation: 909
L is an indication to the C compiler that the string is composed of "wide characters". In Windows, these would be UTF-16 - each character that you put in the string is 16 bits, or two bytes, wide:
L"This is a wide string"
In contrast, a UTF-8 string is always a string composed of bytes. ASCII characters (A-Z 0-9 etc) are encoded the way they have always been - in the range 0x00 to 0x7F (or 0 to 127). International characters (like ř) are encoded using multiple bytes in the range 0x80 to 0xFF - there is a very good explanation on wikipedia. The advantage is that it can be represented using ordinary C strings.
"This is an ordinary string, but also a UTF-8 string"
"This is a C cedilla in UTF-8: \xc3\x87"
However, if you are typing these international characters in to actual code, your editor needs to know that you are typing in UTF-8 so it can encode the characters correctly - like the C cedilla above. Then the string will be passed correctly to your function.
In your case, your comment indicates that you are using UTF-16. In which case there are two other issues:
The console will, by default, not output Unicode characters correctly. You need to change the font to a truetype font like Lucida Console
You also need to change the output mode to a Unicode UTF-16 one. You can do this with:
_setmode(_fileno(stdout), _O_U16TEXT);
Code example:
#include <iostream>
#include <io.h>
#include <fcntl.h>
int wmain(int argc, wchar_t* argv[])
{
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << L"Přetečení zásobníku is Stack overflow in Czech." << std::endl;
}
Upvotes: 1
Reputation: 179917
L""
is a WIDE string. That is to say, it's a a wchar_t[1]
. UTF-8 strings can't be wide, since they are multi-byte (variable length). VC++ is slightly wrong and made wide strings variable length, UTF-16 to be precise. But usually they're UTF-32.
The problem with multi-byte strings is that there are many different encodings, and UTF-8 is only one of them. Windows does not in fact natively support UTF-8 encodings. MessageBoxA()
for instance can use any encoding but UTF-8. There's just one exception to that, which is MultiByteToWideChar(CP_UTF8, ...)
which is what you'd need here.
Upvotes: 1