ebarr
ebarr

Reputation: 7842

Python/ctypes file handle difference between Mac OS X and Ubuntu

I am currently attempting to port some code between my linux machine (Ubuntu 12.04.1 LTS) and my new Mac (OS X 10.7.4) and I have come across some confusing behavior when using python's ctypes module to access the C standard library on the Mac.

To illustrate the problem, the following is a minimal example:

import ctypes as C
import numpy as np

libc = C.CDLL("/usr/lib/libc.dylib")   #/usr/lib/libc.so.6 on ubuntu

np.arange(10,dtype="ubyte").tofile("test.bin") # create some test data

buffer_array = np.empty(10,dtype="ubyte") # create a reading buffer

buffer_array_c = np.ctypeslib.as_ctypes(buffer_array) # get the ctypes version of the buffer 

c_file = libc.fopen("test.bin","r") # open the file through libc   

libc.fread(buffer_array_c, 1, 10, c_file) # read from the file

libc.fclose(c_file)

print "Desired output:"
print np.fromfile("test.bin",dtype="ubyte")
print
print "Actual output:"
print buffer_array

On Linux, this works as expected, producing the following:

Desired output:
[0 1 2 3 4 5 6 7 8 9]

Actual output:
[0 1 2 3 4 5 6 7 8 9]

On the Mac however, I just get `Segmentation fault: 11'.

I have experimented with this a bit, swapping out the fopen call with:

py_file = open("test.bin","r")

c_file = C.pythonapi.PyFile_AsFile(C.py_object(py_file))

Which also works on Linux but not on Mac.

I think the problem is coming from calling fread using c_file, as if I write a minimal C function to open the file then call fread using the previously allocated buffer, the code performs as expected.

I am not normally a Mac user, so the problem may be obvious, but any help would be very useful.

For reference, I am using:

Python 2.7.3, Numpy 1.4.0 and Ctypes 1.1.0

Edit:

To give this some context, I am experimenting with fast methods to read very large binary files (~40-200 GB) into python, piece by piece. As a commenter points out below, there is not really any performance increase to be had from directly accessing the standard library fread and fwrite functions. This is true, but I am confused as to why. If I were to use numpy.fromfile to read a large file in chunks, wouldn't I be creating a new memory allocation with each read?

Solution:

The problem seems to stem from the 64bit/32bit difference in storage of file handles. The solution is simply to explicitly set the restype and argtypes of each c function prior to use.

i.e on a 64-bit machine we put this after the C.CDLL call:

lib.fopen.restype = C.c_long
lib.fread.argtypes = [C.c_void_p, C.c_size_t, C.c_size_t, C.c_long]
lib.fclose.argtypes = [C.c_long]

While on a 32-bit machine:

lib.fopen.restype = C.c_int
lib.fread.argtypes = [C.c_void_p, C.c_size_t, C.c_size_t, C.c_int]
lib.fclose.argtypes = [C.c_int]

Upvotes: 2

Views: 7916

Answers (1)

Armin Rigo
Armin Rigo

Reputation: 12900

Are you trying on a 32-bit Ubuntu versus a 64-bit OS/X? I think the issue is that your version of libc.fopen() returns a C "int", which is almost always a 32-bit value --- but the real fopen() returns a pointer. So on a 64-bit operating system, the c_file that you get is truncated to a 32-bit integer. On a 32-bit operating system, it works anyway because the 32-bit integer can be passed back to the fread() and fclose(), which will interpret it again as a pointer. To fix it, you need to declare the restype of libc.fopen().

(I can only recommend CFFI as an alternative to ctypes with saner defaults, but of course I'm partial there, being one of the authors :-)

Upvotes: 6

Related Questions