Igor Liferenko
Igor Liferenko

Reputation: 1569

How to fix locale?

Add ru_RU.CP1251 locale (on debian uncomment ru_RU.CP1251 in /etc/locale.gen and run sudo locale-gen) and compile the following program with gcc -fexec-charset=cp1251 test.c (input file is in UTF-8). The result is empty. Just letter 'я' is wrong. Other letters are determined either lowercase or uppercase just fine.

#include <locale.h>
#include <ctype.h>
#include <stdio.h>
int main (void)
{
  setlocale(LC_ALL, "ru_RU.CP1251");
  char c = 'я';
  int i;
  char z;
  for (i = 7; i >= 0; i--) {
    z = 1 << i;
    if ((z & c) == z) printf("1"); else printf("0");
  }
  printf("\n");

  if (islower(c))
    printf("lowercase\n");
  if (isupper(c))
    printf("uppercase\n");
  return 0;
}

Why neither islower() nor isupper() work on letter я?

Upvotes: 1

Views: 1513

Answers (3)

Igor Liferenko
Igor Liferenko

Reputation: 1569

The first comment of Jonathan Leffler to OP is true. isxxx() (and iswxxx()) functions are required to handle EOF (WEOF) argument (probably to be fool-proof). This is why int was chosen as the argument type. When we pass argument of type char or character literal, it is promoted to int (preserving the sign). And because by default char type and character literals are signed in gcc, 0xFF becomes -1, which is by unhappy coincidence the value of EOF.

Therefore always do explicit typecasting when passing parameters of type char (and character literals with code 0xFF) to functions, using int argument type (don't count on the unsignedness of char, because it is implementation-defined). Typecasting may be either done via (unsigned char), or via (uint8_t), which is less to type (you must include stdint.h).

See also https://sourceware.org/bugzilla/show_bug.cgi?id=20792 and Why passing char as parameter to islower() does not work correctly?

Upvotes: 1

david.pfx
david.pfx

Reputation: 10868

The answer is that the encoding for the lower case version of that character in CP 1251 is decimal 255, and islower() and isupper() for your implementation do not accept or return that value (which is often interpreted as EOF).

You need to track down the source code for the runtime library to see what it does and why.

The solution is to write your own implementations, or wrap the ones you have. Personally, I never use these functions directly because of the many gotchas.

Upvotes: 1

Luis Colorado
Luis Colorado

Reputation: 12635

Igor, if your file is UTF-8 it's of no sense to try to use code page 1251, as it has nothing in common with utf-8 encoding. Just use locale ru_RU.UTF-8 and you'll be able to display your file without any problem. Or, if you insist on using ru_RU.CP1251, you'll need to first convert your file from utf-8 encoding to cp1251 (you can use the iconv(1) utility for that)

iconv --from-code=utf-8 --to-code=cp1251 your_file.txt > your_converted_file.txt

On other side, the --fexec-charset=cp1251 only affects the characters used on the executable, but you have not specified the input charset to use in string literals in your source code. Probably, the compiler is determining that from the environment (which you have set in your LANG or LC_CHARSET environment variables)

Only once you control exactly what locales are used at each stage, you'll get coherent results.

The main reason an effort is being made to switch all countries to a common charset (UTF) is exactly to not have to deal with all these locale settings at each stage.

If you deal always with documents encoded in CP1251, you'll need to use that encoding for everything on your computer, but when you receive some document encoded in utf-8, then you'll have to convert it to be able to see it right.

I mostly recommend you to switch to utf-8, as it's an encoding that has support for all countries character sets, but at this moment, that decision is only yours.

NOTE

On debian linux:

$ sed 's/^/    /' pru-$$.c 
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <locale.h>

#define P(f,v) printf(#f"(%d /* '%c' */) => %d\n", (v), (v), f(v))
#define Q(v) do{P(isupper,(v));P(islower,(v));}while(0)

int main()
{
    setlocale(LC_ALL, "");
    Q(0xff);
}

Compiled with

$ make pru-$$
cc    pru-1342.c   -o pru-1342

execution with ru_RU.CP1251 locale

$ locale | sed 's/^/    /'
LANG=ru_RU.CP1251
LANGUAGE=
LC_CTYPE="ru_RU.CP1251"
LC_NUMERIC="ru_RU.CP1251"
LC_TIME="ru_RU.CP1251"
LC_COLLATE="ru_RU.CP1251"
LC_MONETARY="ru_RU.CP1251"
LC_MESSAGES="ru_RU.CP1251"
LC_PAPER="ru_RU.CP1251"
LC_NAME="ru_RU.CP1251"
LC_ADDRESS="ru_RU.CP1251"
LC_TELEPHONE="ru_RU.CP1251"
LC_MEASUREMENT="ru_RU.CP1251"
LC_IDENTIFICATION="ru_RU.CP1251"
LC_ALL=

$ pru-$$
isupper(255 /* 'я' */) => 0
islower(255 /* 'я' */) => 512

So, glibc is not faulty, the fault is in your code.

Upvotes: 1

Related Questions