Siddharth Joshi
Siddharth Joshi

Reputation: 157

Differentiation between integer and character

I have just started learning c++ and have come across various data types in c++. I also learnt how the computer stores values when the data type is specified . One doubt that occurred to me while learning char data types was how did the computer differentiate between integers and characters.

I learnt that the character data type uses 8 bits to store a character and the computer can store a character in its memory location by following ASCII encoding rules. However, I didn't realise how the computer knows whether the byte 00100001 represents the latter 'a' or the integer 65. Is there any special bit assigned for this purpose?

Upvotes: 1

Views: 2975

Answers (5)

jgreve
jgreve

Reputation: 1253

You have asked a simple yet profound question. :-)

Answers and an example or two are below. (see edit2, at bottom, for a longer example that tries to illustrate what happens when you interpret a single memory location's bit patterns in different ways).

The "profound" aspect of it lies in the astounding variety of character encodings that exist. There are many - I wager more than you believe there could possibly be. :-)

This is a worthwhile read: http://www.joelonsoftware.com/articles/Unicode.html full title: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"

As for your first question: "how did the computer differentiate between integers and characters": The computer doesn't (for better or worse). The meaning of bit patterns is interpreted by whatever reads them.

Consider this example bit pattern (8 bits, or one byte):

01000001b = 41x = 65d (binary, hex & decimal respectively).

If that bit pattern is based on ASCII it will represent an uppercase A.

If that bit pattern is EBCDIC it will represent an "non-breaking space" character (at least according to the EBCDIC chart at wikipedia, most of the others I looked at don't say what 65d means in EBCDIC).

(Just for trivia's sake, in EBCDIC, 'A' would be represented with a different bit pattern entirely: C1x or 193d.)

If you read that bit pattern is an integer (perhaps a short), it may indicate you have 65 dollars in a bank account (or euros, or something else - just like the character set your bit pattern won't have anything in it to tell you what currency it is.

If that bit pattern is part of a 24-bit pixel encoding for your display (3 bytes for RBG), perhps 'blue' in RBG encoding, it may indicate your pixel is roughly 25% blue (e.g. 65/255 is about 25.4%); 0% would be black, 100% would be as blue as possible.

So, yeah, there are lots of variations on how bits can be interpreted. It is up to your program to keep track of that. edit: it is common to add metadata to track that, so if you are dealing with currencies you may have one byte for currency type and other bytes for the quantity of a given currency. Currency type would have to be encoded as well; there are different ways to do that... something that "C++ enum" attempts to solve in a space-efficient way: http://www.cprogramming.com/tutorial/enum.html ).

As for 8 bits (one byte) per character, that is an Fair Assumption when you're starting out. But it isn't always true. Lots of languages will use 2+ bytes for each character when you get into Unicode.

However... ASCII is very common and it fits into a single byte (8 bits). If you are handling simple english text (A-Z, 0-9 and so on), that my be enough for you.

Spend some time browsing here and look at acsii, ebcdic and others: http://www.lookuptables.com/

If you're running on linux or smth, hexdump can be your friend. Try the following

$ hexdump -C myfile.dat 

Whatever operating system you're using, you will want to find a hexdump utility you can use to see what is really in your data files.

You mentioned C++, I think it would would be an interesting exercise to write a "thing" byte-dumper utility, just a short program that takes a void* pointer and the number of bytes it has and then prints out that many bytes worth of values.

Good luck with your studies! :-)

Edit 2: I added a small research program... I don't know how to illustrate the idea more concisely (seems easer in C than C++). Anyway...

In this example program, I have two character pointers that are referencing memory used by an integer. The actual code (see 'example program', way below) is messier with casting, but this illustrates the basic idea:

unsigned short a;  // reserve 2 bytes of memory to store our 'unsigned short' integer.
char *c1 = &a;     // point to first byte at a's memory location.
char *c2 = c1 + 1; // point to next byte at a's memory location.

Note how 'c1' and 'c2' both share the memory that is also used by 'a'.

Walking through the output...
The sizeof's basically tells you how many bytes something uses. The

===== Message Here =====
lines are like a comment printed out by the dump() function.

The important thing about the dump() function is that it is using the bit patterns in the memory location for 'a'. dump() doesn't change those bit patterns, it just retrieves them and displays them via cout.

In the first run, before calling dump I assign the following bit pattern to a: a = (0x41<<8) + 0x42; This left-shifts 0x41 8 bits and adds 0x42 to it. The resulting bit pattern is = 0x4142 (which is 16706 decimal, or 100001 100010 binary). One of the bytes will be 0x41, the other will hold 0x42. Next it calls the dump() method:

dump( "In ASCII, 0x41 is 'A' and 0x42 is 'B'" );

Note the output for this run on my virtual box Ubuntu found the address of a was 0x6021b8. Which nicely matches the expected addresses pointed to by both c1 & c2.

Then I modify the bit pattern in 'a'... a += 1; dump(); // why did this find a 'C' instead of 'B'?

a += 5;  dump(); // why did this find an 'H' instead of 'C' ?

As you dig deeper into C++ (and maybe C ) you will want to be able to draw memory maps like this (more or less):

=== begin memory map ===

                   +-------+-------+
unsigned short   a : byte0 : byte1 :                  holds 2 bytes worth of bit patterns.
                   +-------+-------+-------+-------+
char *          c1 : byte0 : byte1 : byte3 : byte4 :  holds address of a
                   +-------+-------+-------+-------+
char *          c2 : byte0 : byte1 : byte3 : byte4 :  holds address of a + 1
                   +-------+-------+-------+-------+

=== end memory map ===

Here is what it looks like when it runs; I encourage you to walk through the C++ code in one window and tie each piece of output back to the C++ expression that generated it.

Note how sometimes we do simple math to add a number to a (e.g. "a +=1" followed by "a += 5"). Note the impact that has on the characters that dump() extracts from memory location 'a'.

=== begin run ===

$ clear; g++ memfun.cpp
$ ./a.out
sizeof char =1, unsigned char =1
sizeof short=2, unsigned short=2
sizeof int  =4, unsigned int  =4
sizeof long =8, unsigned long =8
===== In ASCII, 0x41 is 'A' and 0x42 is 'B' =====
a=16706(dec), 0x4142 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=B
c2=A
in hex, c1=42
in hex, c2=41
===== after a+= 1 =====
a=16707(dec), 0x4143 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=C
c2=A
in hex, c1=43
in hex, c2=41
===== after a+= 5 =====
a=16712(dec), 0x4148 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=H
c2=A
in hex, c1=48
in hex, c2=41
===== In ASCII, 0x58 is 'X' and 0x42 is 'Y' =====
a=22617(dec), 0x5859 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=Y
c2=X
in hex, c1=59
in hex, c2=58
===== In ASCII, 0x59 is 'Y' and 0x5A is 'Z' =====
a=22874(dec), 0x595a (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=Z
c2=Y
in hex, c1=5a
in hex, c2=59
Done.
$ 

=== end run ===

=== begin example program ===

#include <iostream>
#include <string>
using namespace std;

// define some global variables
unsigned short a; // declare 2 bytes in memory, as per sizeof()s below.
char *c1 = (char *)&a; // point c1 to start of memory belonging to a (1st byte).
char * c2 = c1 + 1; // point c2 to next piece of memory belonging to a (2nd byte).

void dump(const char *msg) {
   // so the important thing about dump() is that
   // we are working with bit patterns in memory we
   // do not own, and it is memory we did not set (at least
   // not here in dump(), the caller is manipulating the bit
   // patterns for the 2 bytes in location 'a').
   cout << "===== " << msg << " =====\n";
   cout << "a=" << dec << a << "(dec), 0x" << hex << a << dec << " (address of a: " << &a << ")\n";
   cout << "c1=" << (void *)c1 << " (should be the same as 'address of a')\n";
   cout << "c2=" << (void *)c2 << " (should be just 1 more than 'address of a')\n";
   cout << "c1=" << (char)(*c1) << "\n";
   cout << "c2=" << (char)(*c2) << "\n";
   cout << "in hex, c1=" << hex << ((int)(*c1)) << dec << "\n";
   cout << "in hex, c2=" << hex << (int)(*c2) << dec << "\n";
}

int main() {
   cout << "sizeof char =" << sizeof( char  ) << ", unsigned char =" << sizeof( unsigned char  ) << "\n";
   cout << "sizeof short=" << sizeof( short ) << ", unsigned short=" << sizeof( unsigned short ) << "\n";
   cout << "sizeof int  =" << sizeof( int   ) << ", unsigned int  =" << sizeof( unsigned int   ) << "\n";
   cout << "sizeof long =" << sizeof( long  ) << ", unsigned long =" << sizeof( unsigned long  ) << "\n";

   // this logic changes the bit pattern in a then calls dump() to interpret that bit pattern.
   a = (0x41<<8) + 0x42; dump( "In ASCII, 0x41 is 'A' and 0x42 is 'B'" );
   a+= 1;                dump( "after a+= 1" );
   a+= 5;                dump( "after a+= 5" );
   a = (0x58<<8) + 0x59; dump( "In ASCII, 0x58 is 'X' and 0x42 is 'Y'" );
   a = (0x59<<8) + 0x5A; dump( "In ASCII, 0x59 is 'Y' and 0x5A is 'Z'" );

   cout << "Done.\n";

}

=== end example program ===

Upvotes: 1

Pushpendre
Pushpendre

Reputation: 11

The computer itself does not remember or set any bits to distinguish chars from ints. Instead it's the compiler which maintains that information and generates proper machine code which operates on data appropriately.

You can even override and 'mislead' the compiler if you want. For example you can cast a char pointer to a void pointer and then to an int pointer and then try to read the location referred to as an int. I think 'dynamic casts' are also possible. If there was an actual bit used then such operations would not be possible.

Adding more details in response to comment: Hi, really what you should ask is that who will retrieve the values? Imagine that you write the contents of memory to file and send them over the Internet. If the receiver "knows" that its receiving chars then there is no need to encode the identity of chars. But if the receiver could receive either chars or ints then it would need identifying bits. In the same way, when you compile a program and the compiler knows what's stored where, there is no need to 'figure out' anything since you already know it. Now how a char is encoded as bits vs a float vs an int is decided by a standard like IEEE standard

Upvotes: 1

shakhawat
shakhawat

Reputation: 2727

when we do

int a = 65

or

char ch = 'a'

If we check the memory address we will see the value 00100001 as expected.

In application layer we choose to cast as character or integer

prinf("%d", ch)

will print 65

Upvotes: 2

PandaSN
PandaSN

Reputation: 13

int is an integer, a number that has no digits after the decimal point. It can be positive or negative. Internally, integers are stored as binary numbers. On most computers, integers are 32-bit binary numbers, but this size can vary from one computer to another. When calculations are done with integers, anything after the decimal point is lost. So if you divided 2 by 3, the result is 0, not 0.6666.

char is a data type that is intended for holding characters, as in alphanumeric strings. This data type can be positive or negative, even though most character data for which it is used is unsigned. The typical size of char is one byte (eight bits), but this varies from one machine to another. The plot thickens considerably on machines that support wide characters (e.g., Unicode) or multiple-byte encoding schemes for strings. But in general char is one byte.

Upvotes: 0

BASEER ULHASSAN
BASEER ULHASSAN

Reputation: 461

Characters are represented as integers inside the computer. Hence the data type "char" is simply a subset of the data type "int".

Refer to following page: will clear all the ambiguities in your mind. Data Types Detail

Upvotes: 1

Related Questions