Using Unicode text in Windows calls (C++) received from Java SOAP

Question

I am using gSoap with C++ to send and receive web service calls from Java. The difference in what Java considers a character and C/C++ considers a character seems to be wreaking havoc, as is different code pages. This question has a couple different parts.

I have read that Java stores strings in memory as UTF-16 and on disk as UTF-8. When I have a Java string being sent to the C++ client, should I assume that it is UTF-8 or UTF-16?
When I receive a string from Java and just insert it into a std::wstring, without any sort of conversion the C++ client uses the Windows-1252 code page. Is the correct function to receive and convert the unicode string MultiByteToWideChar?
Although the Windows function that I am calling (SetComputerNameExW) is meant to accept Unicode, when I pass in the string that is received from Java via SOAP (I specifically re-encode the string as UTF-8 while debugging) request and decode it as UTF-8 on the C++ side, after passing the value to SetComputerNameExW the system initiates a reboot but only renames the machine to the first character (ie, if my string is ThisIsATëst, then the machine will rename to T). Is there a specific Unicode format that has to be used for these Windows API calls?

Any assistance is greatly appreciated! Thanks!

Remy Lebeau · Accepted Answer

I have read that Java stores strings in memory as UTF-16

It used to, but that is changing. Per JEP 254: Compact Strings, the in-memory storage may soon use ISO-8859-1 instead, but ONLY WHEN it results in a more compact storage than UTF-16 without losing data. Though Java strings use a public interface that is based on UTF-16 (regardless of whether their in-memory storage will use ISO-8859-1 or not). So just pretend they are always UTF-16.

When I have a Java string being sent to the C++ client, should I assume that it is UTF-8 or UTF-16?

You can't assume either encoding. You have to look at the actual SOAP data. SOAP uses XML, and XML can use any character encoding the creator wants, as long as it declares the encoding in the XML prolog (if it is something other than UTF-8, which is the most commonly used encoding in XML). Don't assume, know what you are working with. If you are using a SOAP library, you are limited by whatever encoding it chooses to use for its in-memory strings.

When I receive a string from Java and just insert it into a std::wstring, without any sort of conversion the C++ client uses the Windows-1252 code page.

That is very unlikely, since std::wstring uses UTF-16 on Windows, and Java strings are also UTF-16 (for all intents and purposes). You must be converting your strings incorrectly. Please edit your question to show your actual code.

Is the correct function to receive and convert the unicode string MultiByteToWideChar?

IF you have an 8-bit string ANSI to begin with (char* or std::string), then yes. But that should not be the case when interacting directly with Java (via JNI/JNA) or with std::wstring. So it makes me wonder if you are using a SOAP implementation on the C++ side that is based on 8-bit ANSI strings instead of 16-bit Unicode strings.

Although the Windows function that I am calling (SetComputerNameExW) is meant to accept Unicode, when I pass in the string that is received from Java via SOAP (I specifically re-encode the string as UTF-8 while debugging) request and decode it as UTF-8 on the C++ side, after passing the value to SetComputerNameExW the system initiates a reboot

You can't pass a UTF-8 string to SetComputerNameExW(), the code will not even compile, unless you are using an invalid typecast to force it. You must pass a UTF-16 string instead.

but only renames the machine to the first character (ie, if my string is ThisIsATëst, then the machine will rename to T).

That implies something went very bad with your conversions. Whatever you think you are passing to SetComputerNameExW() is not what is actually being passed, it is not formatted correctly, which is why SetComputerNameExW() only picks up the first character.

But again, this is a situation where you have not shown your actual SOAP data or code, so no one can tell you why the string is not being formatted correctly.

Is there a specific Unicode format that has to be used for these Windows API calls?

The Win32 API only supports two types of strings:

localized ANSI strings
UTF-16 strings

You can't use UTF-8 at all (except in a VERY FEW cases), so you have to convert any UTF-8 data to one of the other formats (preferably UTF-16, since conversions between UTFs is loss-less, and the Windows core is based on UTF-16 anyway).

Using Unicode text in Windows calls (C++) received from Java SOAP

Answers (1)

Related Questions