Mike Perrenoud
Mike Perrenoud

Reputation: 67898

Reading a String as an int pointer

Alright, so this all started with my interest in hash codes. After doing some reading from a Jon Skeet post I asked this question. That got me really interested in pointer arithmetic, something I have almost no experience in. So, after reading through this page I began experimenting as I got a rudimentary understanding from there and my other fantastic peers here on SO!

Now I'm doing some more experimenting, and I believe I've accurately duplicated the hash code loop that's in the string implementation below (I reserve the right to be wrong about that):

Console.WriteLine("Iterating STRING (2) as INT ({0})", sizeof(int));
Console.WriteLine();

var val = "Hello World!";
unsafe
{
    fixed (char* src = val)
    {
        var ptr = (int*)src;
        var len = val.Length;
        while (len > 2)
        {
            Console.WriteLine((char)*ptr);
            Console.WriteLine((char)ptr[1]);

            ptr += 2;
            len -= sizeof(int);
        }

        if (len > 0)
        {
            Console.WriteLine((char)*ptr);
        }
    }
}

But, the results are a bit perplexing to me; kind of. Here are the results:

Iterating STRING (2) as INT (4)

H
l
o
W
r
d

I thought, originally, the value at ptr[1] would be the second letter that is read (or squished together) with the first. However, it's clearly not. Is that because ptr[1] is technically byte 4 on the first iteration and byte 12 on the second iteration?

Upvotes: 4

Views: 446

Answers (6)

Austin Salonen
Austin Salonen

Reputation: 50225

@Simon Whitehead's answer is a great explanation.

Breaking the values down to their bytes as they would reside in memory will help you understand this better. Hopefully the code and comments below will help you see why you were only ever writing the characters at the int* indices.

var val = "Hello World!";
/*
           Hello World!
char idx = 012345678911
                     01

           Hello World!
int idx =  0 1 2 3 4 5

-> this is why len should be 6 below    

*/
unsafe
{
    fixed (char* src = val)
    {
        var ptr = (int*)src;

        //explicit definition of what val.Length / 2 would actually mean
        // -> there are actually 6 integers here but 12 chars
        var len = val.Length * sizeof(char) / sizeof(int);  
        while (len > 0)
        {
            //char pointer to the first "char" of the int
            var word = (char*) ptr;         
            Console.WriteLine(*word);
            /* types matter here.  ptr[1] is the next _integer_ 
               not the next character like it would for a char* */
            Console.WriteLine(word[1]);   //next char of the int @ ptr

            ptr++; // next integer / word[2]
            len--;
        }
    }
}

Upvotes: 3

UpQuark
UpQuark

Reputation: 801

Here's your problem as I see it:

in C#, chars are represented by 2 bytes (16 bits). Integers, on the other hand are 4 bytes (32 bits). Integers are castable to chars UP TO 2^16, because the the same 16 bits that express that integer can be reinterpreted to express a character in UTF-16. The underlying bits are exactly the same, but they are read as a different value.

However what's messing you up is the size difference. An int is 4 bytes to chars 2, so by incrementing (as your int pointer does) in units of SizeOf(Int) (4Bytes) rather than Char or Byte * 2, you are moving forward 32 bits, reading 16, and then skipping forward another 32, causing you to skip every other char. Hence the H L O W R D.

If you want to learn more about pointer artihmetic and bitwise operations, learning some basic C is a cool and pretty fun (subject to debate) way.

Upvotes: 0

Simon Whitehead
Simon Whitehead

Reputation: 65079

Your problem is that you're casting the pointer to an int* pointer.. which is 32 bits.. not 16 like the char*.

Therefore, each increment is 32 bits. Here's a picture (praise my artwork if you must):

char* int* Sorry about the dodgy arrows.. I think my mouse batteries are dying

When you're reading via a char pointer.. you're reading character by character at 16 bits.

When you cast it to an int pointer.. you're reading at 32-bit increments. That means, ptr[0] is both H and e (but points at the base of the H). ptr[1] is both l's..

That is why you are essentially skipping a character in your output.

When you cast it back to a char here:

Console.WriteLine((char)*ptr);

..only the first 16 bits will result from that conversion, which is the first character in each pair.

Upvotes: 12

Konstantin
Konstantin

Reputation: 3294

char is 16 bits while int is 32, so after cast you read 32 bits at the time. You can easily see it if instead of int you use short (16 bit). Then you'll get your Hello

var ptr = (short*)src;

Upvotes: 1

http://msdn.microsoft.com/en-us/library/vstudio/x9h8tsay.aspx

A char is 16 bits, and an int is 32 bits. Every time you add 1 to your int ptr, you're adding 2 char pointers worth.

That's why you're only seeing odd chars.

Upvotes: 3

Gavin
Gavin

Reputation: 516

characters in c# strings are in 2 bytes long as they are encoded in UTF16.

Upvotes: 2

Related Questions