user0000001
user0000001

Reputation: 2233

Determining best practice when using wide characters with non-compatible char API's

Alright, so I've recently dipped back into C++. It's been 13 years since I've even looked at any C/++ code.

I am designing a piece of software for Windows and what I am struggling with is implementing 3rd party code (such as libssh2) that is strictly UTF-8 and provide no other wide character API implementations. Coming back to Windows, every API I've seen uses UTF-16 (wchar_t).

So my question is: Am I forced to do string conversions every time I use a non-standard Windows implementation (libssh2 for example)? I have a variable that is returned as a wchar but the libssh2 API's only provide a char implementation.

Should I stick to using char rather then wchar_t? If I do that, then I am forced once again to convert to wchar_t to use the Windows API. I am using several 3rd party sources and several Windows API's in my code. My head hurts.

What is the best practice here?

Upvotes: 1

Views: 333

Answers (2)

IInspectable
IInspectable

Reputation: 51506

What is the best practice here?

You know the answer already. If an API requires character strings with a particular encoding, you must supply character strings with that character encoding.

If you are dealing with several APIs that expect strings in different character encodings, you have to convert between the encodings.

Windows uses UTF-16 throughout (with very few exceptions). To convert between UTF-8 and UTF-16 you need to call MultiByteToWideChar and WideCharToMultiByte, respectively.


If you need to decide what 'native' character encoding to use in your application, you can use the following list to make an educated decision:

  • Frequency: How often is your application going to call functions that use one encoding or the other. Pick the character encoding your application uses more often.
  • Pattern: Will your application predominantly expose patterns, where character data is passed to several API functions using the same character encoding? If so, that character encoding is a good candidate.
  • Data integrity: One feature of Unicode is, that certain abstract characters can be encoded with different code unit sequences. If you need to preserve the exact code unit sequence between calls, using that character encoding is a safe choice.
  • Safety: One core difference between char and wchar_t (on Windows) is, that wchar_t unambiguously designates UTF-16LE encoded characters, whereas char can be ASCII, ANSI, UTF-8, or some other encoding. If none of the other factors have produced a decision, going with wchar_t/UTF-16 on Windows provides additional safety. It allows the compiler to report an error, when (potentially) passing a non-Unicode character string to an API expecting wchar_t/UTF-16.

Upvotes: 2

andlabs
andlabs

Reputation: 11588

Your best bet is to use the encoding that you most often use everywhere and convert at each other endpoint. In this case, it sounds like you want to use UTF-8 strings everywhere and convert to UTF-16 and back at each Windows API call point (or set of calls, if they're consecutive) since it sounds like you have far more external calls than Windows API calls. This should hopefully limit the number of conversions you have to actually do, and should perform reasonably well. If you find that conversion like this is too slow, use instrumentation to be sure and then see if there are other APIs that you can use for conversion (refer to Raymond Chen's "Loading the dictionary" sub-series for a good read on the latter, but remember Knuth's maxim on premature optimization).

Upvotes: 2

Related Questions