Reputation: 8597
I originally asked a related question here, but didn't really seem to get anywhere. Perhaps if I rephrase part of it more specifically it might help....
I have files stored using Matlab's sparse format (HDF5, csc I believe), and I'm trying to use Pytables to operate on them directly, but haven't succeeded yet. Using h5py I can do the following:
# Method 1: uses h5py (WORKS)
f1 = h5py.File(fname)
data = f1['M']['data']
ir = f1['M']['ir']
jc = f1['M']['jc']
M = scipy.sparse.csc_matrix((data, ir, jc))
but if I try to do the equivalent in Pytables:
# Method 2: uses pyTables (DOESN'T WORK)
f2 = tables.openFile(fname)
data = f2.root.M.data
ir = f2.root.M.ir
jc = f2.root.M.jc
M = scipy.sparse.csc_matrix( (data,ir,jc) )
this fails (after a long wait) with the error:
TypeError Traceback (most recent call last)
/home/tdiethe/BMJ/<ipython console> in <module>()
/usr/lib/python2.6/dist-packages/scipy/sparse/compressed.pyc in __init__(self, arg1, shape, dtype, copy, dims, nzmax)
56 self.indices = np.array(indices, copy=copy)
57 self.indptr = np.array(indptr, copy=copy)
---> 58 self.data = np.array(data, copy=copy, dtype=getdtype(dtype, data))
59 else:
60 raise ValueError, "unrecognized %s_matrix constructor usage" %\
/usr/lib/python2.6/dist-packages/scipy/sparse/sputils.pyc in getdtype(dtype, a, default)
69 canCast = False
70 else:
---> 71 raise TypeError, "could not interpret data type"
72 else:
73 newdtype = np.dtype(dtype)
TypeError: could not interpret data type
Looking at f2
:
In [63]: f2.root.M.data
Out[63]:
/M/data (CArray(4753606,), zlib(3)) ''
atom := Float64Atom(shape=(), dflt=0.0)
maindim := 0
flavor := 'numpy'
byteorder := 'little'
chunkshape := (8181,)
In [64]: f2.root.M.ir
Out[64]:
/M/ir (CArray(4753606,), zlib(3)) ''
atom := UInt64Atom(shape=(), dflt=0)
maindim := 0
flavor := 'numpy'
byteorder := 'little'
chunkshape := (8181,)
In [65]: f2.root.M.jc
Out[65]:
/M/jc (CArray(133339,), zlib(3)) ''
atom := UInt64Atom(shape=(), dflt=0)
maindim := 0
flavor := 'numpy'
byteorder := 'little'
chunkshape := (7843,)
I have two questions:
Upvotes: 3
Views: 2351
Reputation: 3119
I missed seeing this in your original post but I think your issue is in the design of PyTables, which provides an extra level of abstraction on top of the underlying data.
Consider the following:
>>> import tables
>>> import numpy as np
>>> h5_file = tables.openFile(fname)
>>> data = f2.root.M.data
At this point data
is not a numpy
array:
>>> type(data)
tables.array.Array
>>> isinstance(data, np.ndarray)
False
The tables.array.Array
does immediately load the underlying array, or immediately expose array-like functionality. This is what led to the error when you tried to use these types of objects to create a sparse array in scipy
.
Instead the data
object produced by PyTables
is intended to provide access to the data through additional commands (i.e. you did that by using fancy indexing [...]
). In this approach you can access parts of the data or all of it by doing data[:]
or data.read()
. It is only at this point that the familiar numpy
array is produced.
For more information on the tables.array.Array
class see http://pytables.github.com/usersguide/libref.html#the-array-class or the Getting actual data
section at http://www.pytables.org/moin/HowToUse for examples of accessing underlying data.
In comparison pyh5 produces much more array-like objects, although still not numpy
arrays. Consider:
>>> import pyh5
>>> f1 = h5py.File(fname)
>>> data = f1['M']['data']
>>> type(data)
h5py._hl.dataset.Dataset
>>> isinstance(data, np.ndarray)
>>> False
However, you can immediately do numpy
operations on data
like you call to scipy
, or more simple operations like np.cos(data)
or data + np.arange(len(data))
. It also appears that data
object has some familiar numpy
like attributes (i.e. shape
) and that the underlying data (a numpy.ndarray
) is stored at data.value
. However, I'm not familiar with pyh5
as I haven't used it myself, so I'm not sure what the limitations are in this regard.
In general it appears that PyTables
and pyh5
have different design goals and therefore should be used in different ways. pyh5
offers a more Numpy-like interface to HDF files, while PyTables
offers a lot more complex database like operations. See the discussion of the differences at the pyh5
, PyTables
docs and the Enthought mailing list:
Upvotes: 3