Kelv
Kelv

Reputation: 81

C++ Text file won't save in Unicode, it keeps saving in ANSI

So basically, I need to be able to create a text file in Unicode, but whatever I do it keeps saving in ANSI.

Here's my code:

    wchar_t name[] = L"‎中國哲學書電子化計劃";
    FILE * pFile;
    pFile = fopen("chineseLetters.txt", "w");

    fwrite(name, sizeof(wchar_t), sizeof(name), pFile);
    fclose(pFile);

And here is the output of my "chineseLetters.txt":

     -NWòTx[øfû–P[SŠƒR  õ2123

Also, the application is in MBCS and cannot be changed into Unicode, because it needs to work with both Unicode and ANSI.

I'd really appreciate some help here. Thanks.

Thanks for all the quick replies! It works!

Simply adding L"\uFFFE‎中國哲學書電子化計劃" still didn't work, the text editor still recognized it as CP1252 so I did 2 fwrite instead of one, one for the BOM and one for the characters, here's my code now:

    wchar_t name[] = L"‎中國哲學書電子化計劃";
    unsigned char bom[] = { 0xFF, 0xFE };
    FILE * pFile;
    pFile = fopen("chineseLetters.txt", "w");
    fwrite(bom, sizeof(unsigned char), sizeof(bom), pFile);
    fwrite(name, sizeof(wchar_t), wcslen(name), pFile);
    fclose(pFile);

Upvotes: 5

Views: 8290

Answers (2)

Mr.C64
Mr.C64

Reputation: 42974

"Unicode" is a generic term, and you may want to clarify which kind of Unicode encoding you plan to use in your file.

Unicode UTF-8 is a common choice (it's particularly well suited to exchange text data across different platforms, since it has no concept of "endiannes", there's no little-endian/big-endian confusion unlike with UTF-16, and it's widely used across the Internet), but there are also other options (like e.g. UTF-16 on Windows, which directly maps to wchar_t-strings in Visual C++).

If you are using Visual C++, you can specify a ccs attribute in the second parameter of fopen() (or _wfopen()), choosing your desidered encoding, e.g. "ccs=UTF-8" for the UTF-8 encoding.
You can read more details about that on the MSDN documentation of fopen(), e.g.:

fopen supports Unicode file streams. To open a Unicode file, pass a ccs flag that specifies the desired encoding to fopen, as follows.

fp = fopen("newfile.txt", "rt+, ccs= encoding ");

Allowed values of encoding are UNICODE, UTF-8, and UTF-16LE.

I think with UNICODE they mean UTF-16BE (i.e. big-endian UTF-16); the other two options are clear.


EDIT

I tried with this code, and it works fine in saving the Chinese text using Unicode UTF-8 (I used Visual Studio 2013):

wchar_t name[] = L"‎中國哲學書電子化計劃";
FILE * file = fopen("C:\\TEMP\\ChineseLetters.txt", "wt, ccs=UTF-8");
...check for error...

fwrite(name, sizeof(wchar_t), _countof(name)-1, file);
fclose(file);

Note that, after pasting the Chinese text in the source file and saving it, the Visual Studio editor figured out it needed to save the source file in Unicode to not loose text information, and showed a dialog-box asking for confirmation.
So, consider saving the source file in Unicode if you have some "hard-coded" Unicode text in it (in production-quality Windows/C++ code, you may want to save text in resource files).

Note also that I used _countof() instead of sizeof() in the fwrite() call.
You had:

fwrite(name, sizeof(wchar_t), sizeof(name), file);

but that is wrong, since you want to specify as the third argument the count of wchar_ts, not a total size in bytes (note that in MSVC, sizeof(wchar_t) == 2, i.e. a wchar_t is two chars, i.e. two bytes).

Moreover, you have to consider -1 to the total buffer length in wchar_ts, since you don't want to write the NUL-terminating wchar_t in the Unicode string buffer.
(In case of a Unicode UTF-16 wchar_t string of unknown static size, you can simply use wcslen() to get the count of wchar_ts excluding the terminating NUL).

This is how the UTF-8 file written above is correctly opened in Word:

Chinese Text From UTF-8 File Showed in MS Word

Upvotes: 3

bobince
bobince

Reputation: 536489

I need to be able to create a text file in Unicode

Unicode is not an encoding, do you mean UTF-16LE? This is the two-byte-code-unit encoding Windows x86/x64 uses for internal string storage in memory, and some Windows applications like Notepad misleadingly describe UTF-16LE as “Unicode” in their UI.

fwrite(name, sizeof(wchar_t), sizeof(name), pFile);

You've copied the memory storage of the string directly to a file. If you compile this under Windows/MSVCRT then because the internal storage encoding is UTF-16LE, the file you have produced is encoded as UTF-16LE. If you compile this in other environments you will get different results.

And here is the output of my "chineseLetters.txt": -NWòTx[øfû–P[SŠƒR õ2123

That's what the UTF-16LE-encoded data would look like if you misinterpreted the file as Windows Code Page 1252 (Western European).

If you have loaded the file into a Windows application such as Notepad, it probably doesn't know that the file contains UTF-16LE-encoded data, and so defaults to reading the file using your default locale-specific (ANSI, mbcs) code page as the encoding, resulting in the above mojibake.

When you are making a UTF-16 file you should put a Byte Order Mark character U+FEFF at the start of it to let the consumer know whether it's UTF-16LE or UTF-16BE. This also gives apps like Notepad a hint that the file contains UTF-16 at all, and not ANSI. So you would probably find that writing L"\uFEFF‎中國哲學書電子化計劃" would make the output file display better in Notepad.

But it's probably better to convert the wchar_ts into char bytes in a particular desired encoding stated explicitly (eg UTF-8), rather than relying on what in-memory storage format the C library happens to use. On Win32 you can do this using the WideCharToMultibyte API, or with wide-open ccs as described by Mr.C64. If you choose to write a UTF-16LE file with ccs it will put the BOM in for you.

Upvotes: 4

Related Questions