Reputation: 12741

Comparing Unicode Strings in C Returns Different Values Than C#

So I am attempting to write a compare function in C which can take a UTF-8 encoded Unicode string and use the Windows CompareStringEx() function and I am expecting it to work just like .NET's CultureInfo.CompareInfo.Compare().

Now the function I have written in C works some of the time, but not in all cases and I'm trying to figure out why. Here is a case that fails (passes in C#, not in C):

CultureInfo cultureInfo = new CultureInfo("en-US");
CompareOptions compareOptions = CompareOptions.IgnoreCase | CompareOptions.IgnoreKanaType | CompareOptions.IgnoreWidth;

string stringA = "คนอ้วน ๆ";
string stringB = "はじめまして";
//Result is -1 which is expected
int result = cultureInfo.CompareInfo.Compare(stringA, stringB);

And here is what I have written in C. Keep in mind this is supposed to take a UTF-8 encoded string and use the Windows CompareStringEx() function so conversion is necessary.

// Compare flags for the string comparison
#define COMPARE_STRING_FLAGS (NORM_IGNORECASE | NORM_IGNOREKANATYPE | NORM_IGNOREWIDTH)

int CompareStrings(int lenA, const void *strA, int lenB, const void *strB) 
{
    LCID ENGLISH_LCID = MAKELCID(MAKELANGID(LANG_ENGLISH, SUBLANG_ENGLISH_US), SORT_DEFAULT);
    int compareString = -1;

    // Get the size of the strings as UTF-18 encoded Unicode strings. 
    // Note: Passing 0 as the last parameter forces the MultiByteToWideChar function
    // to give us the required buffer size to convert the given string to utf-16s
    int strAWStrBufferSize = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strA, lenA, NULL, 0);
    int strBWStrBufferSize = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strB, lenB, NULL, 0);

    // Malloc the strings to store the converted UTF-16 values
    LPWSTR utf16StrA = (LPWSTR) GlobalAlloc(GMEM_FIXED, strAWStrBufferSize * sizeof(WCHAR));
    LPWSTR utf16StrB = (LPWSTR) GlobalAlloc(GMEM_FIXED, strBWStrBufferSize * sizeof(WCHAR));

    // Convert the UTF-8 strings (SQLite will pass them as UTF-8 to us) to standard  
    // windows WCHAR (UTF-16\UCS-2) encoding for Unicode so they can be used in the 
    // Windows CompareStringEx() function.
    if(strAWStrBufferSize != 0)
    {
        MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strA, lenA, utf16StrA, strAWStrBufferSize);
    }
    if(strBWStrBufferSize != 0)
    {
        MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strB, lenB, utf16StrB, strBWStrBufferSize);
    }

    // Compare the strings using the windows compare function.
    // Note: We subtract 1 from the size since we don't want to include the null termination character
    if(NULL != utf16StrA && NULL != utf16StrB)
    {
        compareValue = CompareStringEx(L"en-US", COMPARE_STRING_FLAGS, utf16StrA, strAWStrBufferSize - 1, utf16StrB, strBWStrBufferSize - 1, NULL, NULL, 0);
    }

    // In the Windows CompareStringEx() function, 0 indicates an error, 1 indicates less than, 
    // 2 indicates equal to, 3 indicates greater than so subtract 2 to maintain C convention
    if(compareValue > 0)
    {
        compareValue -= 2;
    }

    return compareValue;
}

Now if I run the following code, I expect the result to be -1 based on the .NET implementation (see above) but I get 1 indicating that the strings are greater than:

char strA[50] = "คนอ้วน ๆ";
char strB[50] = "はじめまして";

// Will be 1 when we expect it to be -1
int result = CompareStrings(strlen(strA), strA, strlen(strB), strB);

Any ideas on why the results I'm getting are different? I'm using the same LCID/cultureInfo and compareOptions in both implementations and the conversions are successful as far as I can tell.

FYI: This function will be used as a custom collation in SQLite. Not relevant to the question but in case anyone is wondering why the function signature is the way it is.

UPDATE: I also determined that when running the same code in .NET 4 I would see the behavior I saw in the native code. As a result there was now a discrepancy between .NET versions. See my answer below for the reasons behind this.

Upvotes: 3

Answers (3)

Ian Dallas

Reputation: 12741

So I ended up figuring out the issue after contacting Microsoft support. Here is what they had to say about the issue:

The reason for the issue you are seeing, namely, running CompareInfo.Compare against the same string with the same compare options but getting different return values when run under different versions of the .NET Framework, is that the sorting rules are tied to the Unicode spec, which evolves over time. Historically .NET has snapped data for side by side releases to correspond to the newest version of Windows and the corresponding version of Unicode implemented at that time so 2.0, 3.0 and 3.5 correspond to the version for Windows XP or Server 2003, whereas v4.0 matched the Vista sorting rules. As a result the sorting rules for the various versions of the .NET Framework have changed over time.

This also means that when I ran the native code I was calling the sort methods that adhered ot the Vista sorting rules and when I ran in .NET 3.5 I was running sort methods that used the Windows XP sorting rules. Seems odd to me that the Unicode spec would change in such a manner as to cause such a dramatic difference but apparently that's the case here. Seems to me that changing the Unicode spec in such a dramatic way is a fantastic way to break backwards compatibility.

Upvotes: 0

Hans Passant

Reputation: 942000

What you are hoping for here is that your text editor will save the source code file in utf-8 format. And that the compiler will then somehow not interpret the source code as utf-8. That's too much to hope for, at least on my compiler:

warning C4566: character represented by universal-character-name '\u0E04' cannot be represented in the current code page (1252)

Fix:

const wchar_t* strA = L"คนอ้วน ๆ";
const wchar_t* strB = L"はじめまして";

And remove the conversion code.

Upvotes: 2

Jon Skeet

Reputation: 1502056

Well, your code performs several steps here - it's not clear whether it's the compare step which is failing or not.

As a first step, I would write out - in both the .NET code and the C code - the exact UTF-16 code units which you've got in utf16StrA, utf16StrB, stringA and stringB. I wouldn't be at all surprised to find that there's a problem in the input data you're using in the C code.

Upvotes: 3

Comparing Unicode Strings in C Returns Different Values Than C#

Answers (3)

Related Questions