Reputation: 46549
I thought strcmp was supposed to return a positive number if the first string was larger than the second string. But this program
#include <stdio.h>
#include <string.h>
int main()
{
char A[] = "A";
char Aumlaut[] = "Ä";
printf("%i\n", A[0]);
printf("%i\n", Aumlaut[0]);
printf("%i\n", strcmp(A, Aumlaut));
return 0;
}
prints 65
, -61
and -1
.
Why? Is there something I'm overlooking?
I thought that maybe the fact that I'm saving as UTF-8 would influence things.. You know because the Ä
consists of 2 chars there. But saving as an 8-bit encoding and making sure that the strings both have length 1 doesn't help, the end result is the same.
What am I doing wrong?
Using GCC 4.3 under 32 bit Linux here, in case that matters.
Upvotes: 3
Views: 4422
Reputation: 183858
The strcmp
and similar comparison functions treat the bytes in the strings as unsigned char
s, as specified by the standard in section 7.24.4, point 1 (was 7.21.4 in C99)
The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.
(emphasis mine).
The reason is probably that such an interpretation maintains the ordering between code points in the common encodings, while interpreting them a s signed char
s doesn't.
Upvotes: 1
Reputation: 61202
To do string handling correctly in C when the input character set exceeds UTF8 you should use the standard library's wide-character facilities for strings and i/o. Your program should be:
#include <wchar.h>
#include <stdio.h>
int main()
{
wchar_t A[] = L"A";
wchar_t Aumlaut[] = L"Ä";
wprintf(L"%i\n", A[0]);
wprintf(L"%i\n", Aumlaut[0]);
wprintf(L"%i\n", wcscmp(A, Aumlaut));
return 0;
}
and then it will give the correct results (GCC 4.6.3). You don't need a special library.
Upvotes: -1
Reputation: 5917
Saving as an 8-bit ASCII encoding, 'A' == 65
and 'Ä'
equals whatever -61 is if you consider it to be an unsigned char
. Anyway, 'Ä'
is strictly positive and greater than 2^7-1, you're just printing it as if it were signed.
If you consider 'Ä'
to be an unsigned char
(which it is), its value is 195 in your charset. Hence, strcmp(65, 195)
correctly reports -1
.
Upvotes: 1
Reputation: 5702
Check the strcmp manpage:
The strcmp() function compares the two strings s1 and s2. It returns
an integer less than, equal to, or greater than zero if s1 is found,
respectively, to be less than, to match, or be greater than s2.
Upvotes: 0
Reputation: 3431
strcmp() takes chars as unsigned ASCII values. So, your A-with-double-dots isn't char -61, it's char 195 (or maybe 196, if I've got my math wrong).
Upvotes: 1
Reputation: 29519
strcmp
and the other string functions aren't actually utf aware. On most posix machines, C/C++ char
is internally utf8, which makes most things "just work" with regards to reading and writing and provide the option of a library understanding and manipulating the utf codepoints. But the default string.h
functions are not culture sensitive and do not know anything about comparing utf strings. You can look at the source code for strcmp
and see for yourself, it's about as naïve an implementation as possible (which means it's also faster than an internationalization-aware compare function).
I just answered this in another question - you need to use a UTF-aware string library such as IBM's excellent ICU - International Components for Unicode.
Upvotes: 2