Reputation: 1626
I have some UTF-8 strings in memory (this is part of a bigger system) which are basically name of places in European countries. What I'm trying to do is write them to a text file. I'm on my Linux machine (Fedora). So when I write these name strings (char pointers) to file, the file is getting saved in extended ASCII format.
Now I copy this file to my Windows machine where I need to load these names to mySQL DB. When I open the text file on notepad++, again it defaults the encoding to ANSI. But I can select encoding to UTF-8 and almost all the characters looks as expected except the following 3 characters:- Ő, ő and ű. They are displayed within the text as Ő, ő and ű.
Does anyone has any thought on what might be wrong. I know that these are not part of extended ASCII symbols. But the way I'm writing this to the file is something like:
// create out file stream
std::ofstream fs("sample.txt");
// loop through utf-8 formatted string list
if(fs.is_open()) {
for(int i = 0; i < num_strs; i++) {
fs << str_name; // unsigned char pointer representing name in utf-8 format
fs << "\n";
}
}
fs.close();
Everything looks good even with characters like ú and ö and ß. The issue is with the above 3 characters alone. Any thoughts/suggestions/comments on this? Thanks!
As an example, a string like "Gyömrő" shows up as "Gyömrű".
Upvotes: 5
Views: 2165
Reputation: 1123
If, when opening in Notepad++ and choosing UTF-8, and your characters aren't showing up propery, then they are not encoded as UTF-8. You also mention "extended ASCII", which has very little to do with unicode encodings. And my belief is that you are in fact writing your characters as some codepage, for instance "ISO-8859-1".
Try take a look at the byte count of those trouble strings indide your program, and if the byte count is the same as the character count, then you are in fact not encoding it as UTF-8.
Any character that lies outside of the 128 character ASCII table, will be encoded with at least two bytes in UTF-8.
To properly handle unicode within your C++ application, take a look at ICU: http://site.icu-project.org/
Upvotes: 1
Reputation: 153802
The default std::codecvt<char, char, mbstate_t>
doesn't do you any good: this is defined to do no conversion at all. You'd need to imbue()
a std::locale
with a UTF-8 aware code conversion facet. That said, char
can't really represent Unicode values. You'd need a bigger type although the values you are looking at actually do fit into a char
in Unicode but not in any encoding which allows for all values.
The C++ 2011 standard defines a UTF-8 conversion facet std::codecvt_utf<...>
. However, it isn't specialized for the internal type char
but only for wchar_t
, uint16_t
, and uint32_t
. Using clang together with libc++, I could get the following to do the right things:
#include <fstream>
#include <locale>
#include <codecvt>
int main()
{
std::wofstream out("utf8.txt");
std::locale utf8(std::locale(), new std::codecvt_utf8<wchar_t>());
out.imbue(utf8);
out << L"\xd6\xf6\xfc\n";
out << L"Ööü\n";
}
Note that this code use wchar_t
rather than char
. It might look reasonable to use char16_t
or char32_t
because these are meant to be UCS2 and UCS4 encoded, respectively (if I understand the standard correctly), but there are not stream type defined for them. Setting stream types up for a new character type is somewhat of a pain.
Upvotes: -1
Reputation: 466
You need to identify at which stage the unexpected Ő HTML entities are introduced. My best guess is, that they are already in the string you are writing to the file. Use a debugger or add testing code that counts the &s in the string.
That means, your source of information does not strictly use UTF-8 for non-ASCII characters, but occasionally uses HTML entities. This is odd, but possible if your data source is a HTML file (or something like that).
Also, you might want to look at your output file in HEX mode. (There's a nice plugin for Notepad++) This might hopefully help you to understand what UTF-8 really means on the byte level: The 128 ASCII symbols use one byte of a value 0-127. Other symbols use 2-6 bytes (i think), where the first byte must be >127. HTML entities are not really an encoding, more an escape sequence like '\n' '\r'.
Upvotes: 3