Decimal values of Extended ASCII characters

I wrote a function to test if a string consists only of letters, and it works well:

bool is_all_letters(const char* src) {
  while (*src) {
    // A-Z, a-z
    if ((*src>64 && *src<91) || (*src>96 && *src<123)) {
      *src++;
    }
    else {
      return false;
    }
  }
  return true;
}

My next step was to include “Extended ASCII Codes”, I thought it was going to be really easy but that’s where I ran into trouble. For example:

std::cout << (unsigned int)'A' // 65          <-- decimal ascii value
std::cout << (unsigned int)'ñ'; // 4294967281 <-- what?

I thought that the decimal value for ‘ñ’ was going to be 164 as listed on the ASCII chart at www.asciitable.com.

My goal is to restrict user input to only letters in ISO 8859-1 (latin 1). I’ve only worked with single byte characters and would like to avoid multi-byte characters if possible.

I am guessing that I can compare the unsigned int values above, i.e.: 4294967281, but it does not feel right to me and besides, I don’t know if that large integer is VC 8.0 representation of 'ñ' and changes from compiler to compiler.

Please advise

UPDATE - Per some suggestions made by Christophe, I ran the following code:

locale loc("spanish") ;
cout<<loc.name() << endl;                   // Spanish_Spain.1252
for (int i = 0; i < 255; i++) {
  cout << i << " " << isalpha(i, loc)<< " " << (isprint(i,loc) ? (char)(i):'?') << endl; 
}

It does return Spanish_Spain.1252 but unfortunately, the loop iterations print the same data as the default C locale (using VC++ 8 / VS 2005).

Christophe shows different (desired) results as you can see in his screen shots below, but he uses a much newer version of VC++.

Upvotes: 2

Answers (4)

Christophe

Reputation: 73376

There is already plenty of information here. However, I'd like to propose some ideas to adress your inital problem, being the categorisation of extended character set.

For this, I suggest the use of <locale> (country specific topics), and especially the new locale-aware form of isalpha(), isspace(), isprint(), ... .

Here a little piece of code to help you to find out what chars could be a letter in your local alphabet:

std::locale::global(std::locale(""));               // sets the environment default locale currently in place 
std::cout << std::locale().name() << std::endl;     // display name of current locale 

std::locale loc ;                                   // use a copy of the active global locale (you could use another)
for (int i = 0; i < 255; i++) {
    cout << i << " " << isalpha(i, loc)<< " " << (isprint(i,loc) ? (char)(i):'?') << endl; 
}

This will print out the ascii code from 0 to 255, followed by an indicator if it is a letter according to the local settings, and the character itself if it's printable.

FOr example, on my PC, I get:
screenshot because of char encoding differences And all the accented chars, as well as ñ, and greek letters are considered as alpha, whereas £ and mathematical symbols are considered as non alpha printable.

Upvotes: 0

dan04

Reputation: 91015

I thought that the decimal value for ‘ñ’ was going to be 164 as listed on the ASCII chart at www.asciitable.com.

Asciitable.com appears to give the code for the old IBM437 DOS character set (still used in the Windows command prompt), in which ñ is indeed 164. But that's just one of hundreds of “extended ASCII” variants.

The value 4294967281 = 0xFFFFFFF1 you got is a sign-extension of the (signed) char value 0xF1, which is how ñ is encoded in ISO-8859-1 and close variants like Windows-1252.

Upvotes: 3

rici

Reputation: 241761

The code chart you found on the internet is actually Windows OEM code page 437, which was never endorsed as a standard. Although it is sometimes called "extended ASCII", that description is highly misleading. (See the Wikipedia article Extended ASCII: "The use of the term is sometimes criticized, because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, both of which are untrue."

You can find the history of OEM437 on Wikipedia, in various versions.

What was endorsed as a standard 8-bit encoding is ISO-8859-1, which later became the first 256 code points in Unicode. (It's one of a series of 8-bit encodings designed for use in different parts of the world; ISO-8859-1 is specified to the Americas and Western Europe.) So that's what you will find in most computers produced in this century in those regions, although more recently more and more operating systems are converting to full Unicode support.

The value you see for (unsigned int)'ñ' is the result of casting the ISO-8859-1 code 0xF1 from a (signed) char (that is, -15) to an unsigned int. Had you cast it to an int, you would have seen -15.

Upvotes: 3

MSalters

Reputation: 179887

To start with, you're trying to reinvent std::isalpha. But you'll need to pass the ISO-8859-1 locale IIRC, by default that just checks ASCII.

The behavior you see is because char is signed (because you didn't compile with /J, which is the smart thing to do when you use more than just ASCII - VC++ defaults to signed char).

Upvotes: 2

Decimal values of Extended ASCII characters

Answers (4)

Related Questions