Tebe
Tebe

Reputation: 3214

c++ encodings file creation with proper displayed name

Everything what I want is just to create a file but with a correctly displayed name on Linux and Windows.

On Linux this code works fine and I think that it's so because of properly handling of UTF-8.

On Windows there are some problems. I have two set languages English and Russian. If I use System Encoding in my programming environment (QT Creator) then the created file has almost correctly name, but to regret other letters from German, French (and I suspect Japan, Chinese are not exception) can't be used in the name of file, otherwise , as one can see they are truncated. So, it's a bad approach. Because names can be from any different language.

I.e. I wanted a name to look so:
string s="тдöüлотдFILE";

But it looks so:

enter image description here

I changed Encoding in Qt Creator to UTF-8 in hope that it will work correctly.

But now I get this:

string s="тдöüлотдFILE"; - expected name

Gotten name: enter image description here It looks even yet worse.

I tried to change encoding in Qt Creator to UTF-16 (I heard windows uses it), but as result compiler refuses to compile code in this encoding (the same is with UTF16LE,BE,UTF32)

Whole situation:

enter image description here

I suspect that problem lies in how Windows interpret names. But how can I say it to display it correctly as it ought to be and in the same time working on Linux?

Upvotes: 0

Views: 219

Answers (2)

Christian Stieber
Christian Stieber

Reputation: 12496

Well, this doesn't describe how to fix it, but I "need" more than 500 chars :-)

Before I try to explain (in a confusing way...) what the problem is that you are looking at: you might want to try to conditionalize the filename for the platforms (I can't remember the official macros to identify each the platform, so please replace with the correct ones):

#if defined(LINUX)
const char* Filename="тдöüлотдFILE";
#elif defined(WINDOWS)
const wchar_t* Filename=L"тдöüлотдFILE";
#endif

fstream f(Filename,...);

This still requires that your sourcecode is in whatever encoding your compiler expects. If that happens to be the system codepage, you might not even be able to ever get these characters into a string literal (but, if the wchar_t version works, you can also construct the filename using the integer codes for the characters. Less readable, but it doesn't depend on the source file encoding).

The problem you are dealing with is quite complex, and might be impossible to solve in an easy way.

Windows is using UTF16 internally (since XP, 2000 and NT used UCS2, 9x and 3.x used codepages). Linux users have pretty much moved to UTF-8, although there are still developers that haven't heard about that. But it's improving.

Now, while UTF-8 has a codepage value, it can't actually be a system codepage. The codepage value is just for the functions that convert between codepages and UTF-16, but each system still has a legacy-codepage that is NOT UTF-8. The legacy or "ANSI" API on Windows takes strings encoded in the system codepage, whereas the Unicode API takes them in UTF-16. There is no other option.

So, obviously, Windows programs like to use UTF-16. However, Linux doesn't like it very much at all, they prefer UTF-8. I use a framework of my own to help leverage such problems (and other things, of course) between Windows, Linux and MacOS; existing frameworks such as Qt do it too. Without such help, the safest option is to stick to string literals in ASCII.

Your IDE setting can only affect how the source code is stored; it can't affect how the runtime treats literals, or what APIs are eventually used by the runtime.

You can try to cook something up, such as using Microsofts "TCHAR" setup that was meant to allow programs to be compiled using "ANSI" (no, I have no idea why they chose that name) or Unicode with a simple switch. I'm not particular familiar with or interested in it, but it defines types (such as TCHAR for a single char) and macros for string literals, and causes the appropiate mapping for Windows API functions (like calls to 'CreateFile' will turn out to be calls to CreateFileW or CreateFileA). One option that comes to mind is to compile stuff as Unicode for windows, and typedef/define the appropiate stuff for Linux to produce the "char"-based variant of the code. You might also have to use std::basic_string instead of std::string.

As a sidenote, VisualC++ 2012, to my knowledge, accepts source code in UTF-8 and UTF-16. I do not, however, know what it puts into "char*" literals (in my code, I only allow ASCII in such literals to be on the safe side. 'Obscure' characters come from string files anyway; I only need literals for filenames, registry keys, internal keys etc.).

Upvotes: 2

BigBoss
BigBoss

Reputation: 6914

As a general rule this is not a good idea to write Unicode( non-ascii ) strings as ansi string literals, since this strings use one byte char, they can't handle Unicode characters and then your compiler either use UTF-8( this is default in most POSIX compilers since UTF-8 is native encoding of OS, but remember it depend on the compiler not on C++ standard ) or use default encoding of the system( in Windows it is configurable in control panel, so your code may work on one system and fail in another ). the correct way is to use C++ wide string literals as L"тдöüлотдFILE", in this case compiler will emit Unicode representation of your string that will work on all machines with all settings.

Now the problem is file system in POSIX work with UTF-8 and in Windows work with UTF-16, if using boost is one of your options you can use lovely boost::path that do every thing for you otherwise you can implement it using conditional compilation on Windows and POSIX

Upvotes: 0

Related Questions