Reputation: 21318
According to the standard:
The values of the members of the execution character set are implementation-defined.
(ISO/IEC 9899:1999 5.2.1/1)
Further in the standard:
...the value of each character after
0
in the above list of decimal digits shall be one greater than the value of the previous.
(ISO/IEC 9899:1999 5.2.1/3)
It appears that the standard requires that the execution character set includes the 26 uppercase and 26 lowercase letters of the Latin alphabet, but I see no requirement that these characters be ordered in any way. I only see an order stipulation for the decimal digits.
This would seem to imply that, strictly speaking, there is no guarantee that 'a' < 'b'
. Now, the letters of the alphabet are in order in each of ASCII, UTF-8, and EBCDIC. But for ASCII and UTF-8 we have 'A' < 'a'
, while for EBCDIC we have 'a' < 'A'
.
It might be nice to have a function in ctype.h
that compares alphabetic characters portably. Short of this or something similar, it seems to me that one must look in the locale to find the value of CODESET
and proceed accordingly, but this doesn't seem simple.
My gut tells me that this is almost never an issue; for most cases alphabetical characters can be handled by converting to lowercase, because for the most commonly used character sets the letters are in order.
The question: given two chars
char c1;
char c2;
is there a simple, portable way to determine if c1
precedes c2
alphabetically? Or do we assume that the lowercase and uppercase characters always occur in sequence, even though this does not appear to be guaranteed by the standard?
To clarify any confusion, I am really just interested in the 52 letters of the Latin alphabet that are guaranteed by the standard to be in the execution character set. I realize that other sets of letters are important, but it seems that we can't even know about the ordering of this small subset of letters.
I think that I need to clarify a bit more. The issue, as I see it, is that we commonly think of the 26 lowercase letters of the Latin alphabet as being ordered. I would like to be able to assert that 'a' comes before 'b', and we have a convenient way of expressing this in code as 'a' < 'b'
, when we give 'a' and 'b' integral values. But the standard gives no assurances that the above code will evaluate as expected. Why not? The standard does guarantee this behavior for the digits 0-9, and this seems sensible. If I want to determine if one letter-char precedes another, say for sorting purposes, and if I want this code to be truly portable, it seems like the standard offers no help. Now I have to rely on the convention that ASCII, UTF-8, EBCDIC, etc. have adopted that 'a' < 'b'
should be true. But this isn't really portable unless the only character sets used rely on this convention; this may be true.
This question originated for me in another question thread: Check if a letter is before or after another letter in C. Here, a few people suggested that you could determine the order of two letters stored in char
s using inequalities. But one commenter pointed out that this behavior is not guaranteed by the standard.
Upvotes: 12
Views: 1102
Reputation: 153358
With C11, code could use _Static_assert()
to insure, at compile time, that characters have a desired ordering.
An advantage to this approach is that since the overwhelming character codings all ready meet the desired A-Z requirement, should a novel or esoteric platform use something different, it may require a coding or customization that is not foreseeable. This best code can do in that case is to fail to compile.
Example use
// Sample case insensitive string sort routine that insures
// 1) 'A' < 'B' < 'C' < ... < 'Z'
// 2) 'a' < 'b' < 'c' < ... < 'z'
int compare_string_case_insensitive(const void *a, const void *b) {
_Static_assert('A' < 'B', "A-Z order unexpected");
_Static_assert('B' < 'C', "A-Z order unexpected");
_Static_assert('C' < 'D', "A-Z order unexpected");
// Other 21 _Static_assert() omitted for brevity
_Static_assert('Y' < 'Z', "A-Z order unexpected");
_Static_assert('a' < 'b', "a-z order unexpected");
_Static_assert('b' < 'c', "a-z order unexpected");
_Static_assert('c' < 'd', "a-z order unexpected");
// Other 21 _Static_assert() omitted for brevity
_Static_assert('y' < 'z', "a-z order unexpected");
const char *sa = (const char *)a;
const char *sb = (const char *)b;
int cha, chb;
do {
cha = toupper((unsigned char) *sa++);
chb = toupper((unsigned char) *sb++);
} while (cha && cha == chb);
return (cha > chb) - (cha < chb);
}
Upvotes: 3
Reputation: 2321
You could probably just make a table for the characters the standard garantees there will be to ASCII character numbers. E.g.,
#include <limits.h>
static char mytable[] = {
['a'] = 0x61,
['b'] = 0x62,
// ...
['A'] = 0x41,
['B'] = 0x42,
// ...
};
The compiler will map every characters in the current character set (which may be any crazy character set) to ASCII codes, and the characters which are not garanteed to exist will be mapped to zero. Then you can use this table for ordering whenever needed.
As you said,
char c1;
char c2;
Could portably be verified to be alphabetically ordered by checking
(c1 < sizeof(mytable) && c2 < sizeof(mytable) ? mytable[c1] < mytable[c2] : 0)
I've actually used this on a research project which runs on ASCII and EBCDIC for predictable ordering, but it's portable enough to work on any character set. Edit: I've actually let the size of the table empty, so that it would compute to the minimum needed, because of the DeathStation 9000, on which a byte might have 32bits and hence CHAR_MAX
be up to 4294967295 or greater.
Upvotes: 6
Reputation: 153358
For A-Z,a-z
in a case-insensitive manner (and using compound literals):
char ch = foo();
az_rank = strtol((char []){ch, 0}, NULL, 36);
For 2 char
that are known to be A-Z,a-z but may be ASCII or EBCDIC.
int compare2alpha(char c1, char c2) {
int mask = 'A' ^ 'a'; // Only 1 bit is different between upper/lower
return (c1 | mask) - (c2 | mask);
}
Alternatively, if limited to 256 differ char
, could use a look-up table that maps the char
to its rank. Of course the table is platform dependent.
Upvotes: 4
Reputation: 6404
strcoll is designed for this purpose. Simply set up two strings of one character each. (normally you want to compare strings, not characters).
Upvotes: 10
Reputation: 13171
There are historically used codes that don't simply order the alphabet. Baudot, for example, puts vowels before consonants, so 'A' < 'B', but 'U' < 'B' as well.
There are also codes like EBCDIC that are ordered, but with gaps. So in EBCDIC, 'I' < 'J', but 'I' + 1 != 'J'.
Upvotes: 6