Reputation: 81
So basically, I need to be able to create a text file in Unicode, but whatever I do it keeps saving in ANSI.
Here's my code:
wchar_t name[] = L"中國哲學書電子化計劃";
FILE * pFile;
pFile = fopen("chineseLetters.txt", "w");
fwrite(name, sizeof(wchar_t), sizeof(name), pFile);
fclose(pFile);
And here is the output of my "chineseLetters.txt":
-NWòTx[øfû–P[SŠƒR õ2123
Also, the application is in MBCS and cannot be changed into Unicode, because it needs to work with both Unicode and ANSI.
I'd really appreciate some help here. Thanks.
Thanks for all the quick replies! It works!
Simply adding L"\uFFFE中國哲學書電子化計劃" still didn't work, the text editor still recognized it as CP1252 so I did 2 fwrite instead of one, one for the BOM and one for the characters, here's my code now:
wchar_t name[] = L"中國哲學書電子化計劃";
unsigned char bom[] = { 0xFF, 0xFE };
FILE * pFile;
pFile = fopen("chineseLetters.txt", "w");
fwrite(bom, sizeof(unsigned char), sizeof(bom), pFile);
fwrite(name, sizeof(wchar_t), wcslen(name), pFile);
fclose(pFile);
Upvotes: 5
Views: 8290
Reputation: 42974
"Unicode" is a generic term, and you may want to clarify which kind of Unicode encoding you plan to use in your file.
Unicode UTF-8 is a common choice (it's particularly well suited to exchange text data across different platforms, since it has no concept of "endiannes", there's no little-endian/big-endian confusion unlike with UTF-16, and it's widely used across the Internet), but there are also other options (like e.g. UTF-16 on Windows, which directly maps to wchar_t
-strings in Visual C++).
If you are using Visual C++, you can specify a ccs
attribute in the second parameter of fopen()
(or _wfopen()
), choosing your desidered encoding, e.g. "ccs=UTF-8"
for the UTF-8 encoding.
You can read more details about that on the MSDN documentation of fopen()
, e.g.:
fopen
supports Unicode file streams. To open a Unicode file, pass accs
flag that specifies the desired encoding tofopen
, as follows.fp = fopen("newfile.txt", "rt+, ccs= encoding ");
Allowed values of encoding are
UNICODE
,UTF-8
, andUTF-16LE
.
I think with UNICODE
they mean UTF-16BE (i.e. big-endian UTF-16); the other two options are clear.
EDIT
I tried with this code, and it works fine in saving the Chinese text using Unicode UTF-8 (I used Visual Studio 2013):
wchar_t name[] = L"中國哲學書電子化計劃";
FILE * file = fopen("C:\\TEMP\\ChineseLetters.txt", "wt, ccs=UTF-8");
...check for error...
fwrite(name, sizeof(wchar_t), _countof(name)-1, file);
fclose(file);
Note that, after pasting the Chinese text in the source file and saving it, the Visual Studio editor figured out it needed to save the source file in Unicode to not loose text information, and showed a dialog-box asking for confirmation.
So, consider saving the source file in Unicode if you have some "hard-coded" Unicode text in it (in production-quality Windows/C++ code, you may want to save text in resource files).
Note also that I used _countof()
instead of sizeof()
in the fwrite()
call.
You had:
fwrite(name, sizeof(wchar_t), sizeof(name), file);
but that is wrong, since you want to specify as the third argument the count of wchar_t
s, not a total size in bytes (note that in MSVC, sizeof(wchar_t) == 2
, i.e. a wchar_t
is two char
s, i.e. two bytes).
Moreover, you have to consider -1
to the total buffer length in wchar_t
s, since you don't want to write the NUL
-terminating wchar_t
in the Unicode string buffer.
(In case of a Unicode UTF-16 wchar_t
string of unknown static size, you can simply use wcslen()
to get the count of wchar_t
s excluding the terminating NUL
).
This is how the UTF-8 file written above is correctly opened in Word:
Upvotes: 3
Reputation: 536489
I need to be able to create a text file in Unicode
Unicode is not an encoding, do you mean UTF-16LE? This is the two-byte-code-unit encoding Windows x86/x64 uses for internal string storage in memory, and some Windows applications like Notepad misleadingly describe UTF-16LE as “Unicode” in their UI.
fwrite(name, sizeof(wchar_t), sizeof(name), pFile);
You've copied the memory storage of the string directly to a file. If you compile this under Windows/MSVCRT then because the internal storage encoding is UTF-16LE, the file you have produced is encoded as UTF-16LE. If you compile this in other environments you will get different results.
And here is the output of my "chineseLetters.txt": -NWòTx[øfû–P[SŠƒR õ2123
That's what the UTF-16LE-encoded data would look like if you misinterpreted the file as Windows Code Page 1252 (Western European).
If you have loaded the file into a Windows application such as Notepad, it probably doesn't know that the file contains UTF-16LE-encoded data, and so defaults to reading the file using your default locale-specific (ANSI, mbcs) code page as the encoding, resulting in the above mojibake.
When you are making a UTF-16 file you should put a Byte Order Mark character U+FEFF at the start of it to let the consumer know whether it's UTF-16LE or UTF-16BE. This also gives apps like Notepad a hint that the file contains UTF-16 at all, and not ANSI. So you would probably find that writing L"\uFEFF中國哲學書電子化計劃"
would make the output file display better in Notepad.
But it's probably better to convert the wchar_t
s into char
bytes in a particular desired encoding stated explicitly (eg UTF-8), rather than relying on what in-memory storage format the C library happens to use. On Win32 you can do this using the WideCharToMultibyte
API, or with wide-open ccs
as described by Mr.C64. If you choose to write a UTF-16LE file with ccs
it will put the BOM in for you.
Upvotes: 4