user626998
user626998

Reputation:

Translating a C binary data read function to Python

(I've edited this for clarity, and changed the actual question a bit based on EOL's answer) I'm trying to translate the following function in C to Python but failing miserably (see C code below). As I understand it, it takes four 1-byte chars starting from the memory location pointed to by from, treats them as unsigned long ints in order to give each one 4 bytes of space, and does some bitshifting to arrange them as a big-endian 32-bit integer. It's then used in an algorithm of checking file validity. (from the Treaty of Babel)

static int32 read_alan_int(unsigned char *from)
{
 return ((unsigned long int) from[3])| ((unsigned long int)from[2] << 8) |
       ((unsigned long int) from[1]<<16)| ((unsigned long int)from[0] << 24);
}
/*
  The claim algorithm for Alan files is:
   * For Alan 3, check for the magic word
   * load the file length in blocks
   * check that the file length is correct
   * For alan 2, each word between byte address 24 and 81 is a
      word address within the file, so check that they're all within
      the file
   * Locate the checksum and verify that it is correct
*/
static int32 claim_story_file(void *story_file, int32 extent)
{
 unsigned char *sf = (unsigned char *) story_file;
 int32 bf, i, crc=0;
 if (extent < 160) return INVALID_STORY_FILE_RV;
 if (memcmp(sf,"ALAN",4))
 { /* Identify Alan 2.x */
 bf=read_alan_int(sf+4);
 if (bf > extent/4) return INVALID_STORY_FILE_RV;
 for (i=24;i<81;i+=4)
 if (read_alan_int(sf+i) > extent/4) return INVALID_STORY_FILE_RV;
 for (i=160;i<(bf*4);i++)
 crc+=sf[i];
 if (crc!=read_alan_int(sf+152)) return INVALID_STORY_FILE_RV;
 return VALID_STORY_FILE_RV;
 }
 else
 { /* Identify Alan 3 */
   bf=read_alan_int(sf+12);
   if (bf > (extent/4)) return INVALID_STORY_FILE_RV;
   for (i=184;i<(bf*4);i++)
    crc+=sf[i];
 if (crc!=read_alan_int(sf+176)) return INVALID_STORY_FILE_RV;

 }
 return INVALID_STORY_FILE_RV;
}

I'm trying to reimplement this in Python. For implementing the read_alan_int function, I would think that importing struct and doing struct.unpack_from('>L', data, offset) would work. However, on valid files, this always returns 24 for the value bf, which means that the for loop is skipped.

def read_alan_int(file_buffer, i):
    i0 = ord(file_buffer[i]) * (2 ** 24)
    i1 = ord(file_buffer[i + 1]) * (2 ** 16)
    i2 = ord(file_buffer[i + 2]) * (2 ** 8)
    i3 = ord(file_buffer[i + 3])
    return i0 + i1 + i2 + i3

def is_a(file_buffer):
    crc = 0
    if len(file_buffer) < 160:
        return False
    if file_buffer[0:4] == 'ALAN':
        # Identify Alan 2.x
        bf = read_alan_int(file_buffer, 4)
        if bf > len(file_buffer)/4:
            return False
        for i in range(24, 81, 4):
            if read_alan_int(file_buffer, i) > len(file_buffer)/4:
                return False
        for i in range(160, bf * 4):
            crc += ord(file_buffer[i])
        if crc != read_alan_int(file_buffer, 152):
            return False
        return True
    else:
        # Identify Alan 3.x
        #bf = read_long(file_buffer, 12, '>')
        bf = read_alan_int(file_buffer, 12)
        print bf
        if bf > len(file_buffer)/4:
            return False
        for i in range(184, bf * 4):
            crc += ord(file_buffer[i])
        if crc != read_alan_int(file_buffer, 176):
            return False
        return True
    return False


if __name__ == '__main__':
    import sys, struct
    data = open(sys.argv[1], 'rb').read()
    print is_a(data)

...but the damn thing still returns 24. Unfortunately, my C skills are non-existent so I'm having trouble getting the original program to print some debug output so I can know what bf is supposed to be.

What am I doing wrong?


Ok, so I'm apparently doing read_alan_int correctly. However, what's failing for me is the check that the first 4 characters are "ALAN". All of my test files fail this test. I've changed the code to remove this if/else statement and to instead just take advantage of early returns, and now all of my unit tests pass. So, on a practical level, I'm done. However, I'll keep the question open to address the new problem: how can I possibly wrangle the bits to get "ALAN" out of the first 4 chars?

def is_a(file_buffer):
    crc = 0
    if len(file_buffer) < 160:
        return False
    #if file_buffer.startswith('ALAN'):
        # Identify Alan 2.x
    bf = read_long(file_buffer, 4)
    if bf > len(file_buffer)/4:
        return False
    for i in range(24, 81, 4):
        if read_long(file_buffer, i) > len(file_buffer)/4:
            return False
    for i in range(160, bf * 4):
        crc += ord(file_buffer[i])
    if crc == read_long(file_buffer, 152):
        return True
    # Identify Alan 3.x
    crc = 0
    bf = read_long(file_buffer, 12)
    if bf > len(file_buffer)/4:
        return False
    for i in range(184, bf * 4):
        crc += ord(file_buffer[i])
    if crc == read_long(file_buffer, 176):
        return True
    return False

Upvotes: 4

Views: 671

Answers (3)

Eric O. Lebigot
Eric O. Lebigot

Reputation: 94485

Your Python version looks fine to me.

PS: I missed the "memcmp() catch" that DSM found, so the Python code for if memcmp(…)… should actually be `if file_buffer[0:4] != 'ALAN'.

As far as I can see from the C code and from the sample file you give in the comments to the original question, the sample file is indeed invalid; here are the values:

read_alan_int(sf+12) == 24  # 0, 0, 0, 24 in file sf, big endian
crc = 0
read_alan_int(sf+176) = 46  # 0, 0, 0, 46 in file sf, big endian

So, crc != read_alan_int(sf+176), indeed.

Are you sure that the sample file is a valid file? Or is part of the calculation of crc missing from the original post??

Upvotes: 0

John Machin
John Machin

Reputation: 82934

Hypothesis 1: You are running on Windows, and you haven't opened your file in binary mode.

Upvotes: 0

DSM
DSM

Reputation: 353059

Ah, I think I've got it. Note that the description says

/*
  The claim algorithm for Alan files is:
   * For Alan 3, check for the magic word
   * load the file length in blocks
   * check that the file length is correct
   * For alan 2, each word between byte address 24 and 81 is a
      word address within the file, so check that they're all within
      the file
   * Locate the checksum and verify that it is correct
*/

which I read as saying that there's a magic word in Alan 3, but not in Alan 2. However, your code goes the other way, even though the C code only assumes that the ALAN exists for Alan 3 files.

Why? Because you don't speak C, so you guessed -- naturally enough! -- that memcmp would return (the equivalent of a Python) True if the first four characters of sf and "ALAN" are equal.. but it doesn't. memcmp returns 0 if the contents are equal, and nonzero if they differ.

And that seems to be the way it works:

>>> import urllib2
>>> 
>>> alan2 = urllib2.urlopen("http://ifarchive.plover.net/if-archive/games/competition2001/alan/chasing/chasing.acd").read(4)
>>> alan3 = urllib2.urlopen("http://mirror.ifarchive.org/if-archive/games/competition2006/alan/enterthedark/EnterTheDark.a3c").read(4)
>>> 
>>> alan2
'\x02\x08\x01\x00'
>>> alan3
'ALAN'

Upvotes: 1

Related Questions